In addition to using it as part of an R script, there are some
utilities included in {CategoryCompare2} to allow use from
the command line. This is known as a command line
interface, or CLI.
Although it still requires having R and the
{categoryCompare2} package installed, once installed, you
should be able to do a full analysis from the command line.
If you don’t have the package installed yet, make sure you install it and dependent packages.
install.packages("BiocManager")
BiocManager::install("Biobase")
BiocManager::install("remotes")
BiocManager::install(
"moseleybioinformaticslab/categoryCompare2",
dependencies = c("Depends", "Imports", "Suggests")
)If you do have it installed, but haven’t used the CLI before, you
will want to install at least docopt, as
it provides the structure around the CLI.
cli_location = system.file("exec", package = "categoryCompare2")
cli_location
#> [1] "/tmp/RtmpJT1f9F/Rinst1a612c8e672f/categoryCompare2/exec"
dir(cli_location)
#> [1] "categoryCompare2.R" "create_annotations.R" "feature_files_2_json.R"
#> [4] "filter_and_group.R" "run_enrichment.R"
Sys.setenv(CLI_LOCATION = cli_location)Assuming you are on a linux type system, we can add these to the path, and set them to be executable. I’m going to set them here in R, because that’s what I have access to. We show how to do it in the shell as well.
old_path = Sys.getenv("PATH")
new_path = paste0(old_path, ":", cli_location)
Sys.setenv(PATH = new_path)Assuming that we’ve saved the path in a shell variable,
CLI_LOCATION, we can add it to the path, and change the
executables to be executable.
After we make the scripts executable, we can check that we can actually use it.
Now, we will create a directory to put our input files and results.
For hypergeometric enrichment, the {categoryCompare2}
CLI needs a few different pieces of information to work:
For this small example, we are going to use the same estrogen
microarray dataset as in the main vignette. We’ve found the differential
sets of genes for each timepoint, 10 and 48 hours. The set of all genes
measured on the array is in universe_entrez.txt, the 10
hour differential genes are in 10_entrez.txt, and the 48
hour differential genes in 48_entrez.txt.
{knitr} is not good at running sequential bash or shell
commands, so we are going to go through the whole analysis via CLI here,
with some comments, but we will then discuss each one further down, just
without any output.
test_loc = system.file("extdata", "test_data", package = "categoryCompare2")
Sys.setenv(TESTLOC = test_loc)export CURR_DIR=$(pwd)
export WORKING=$CURR_DIR/cc2_1234
mkdir -p $WORKING
cd $WORKING
cp $TESTLOC/10_entrez.txt .
cp $TESTLOC/48_entrez.txt .
cp $TESTLOC/universe_entrez.txt .
# get the annotations from installed organism database
create_annotations.R --orgdb=org.Hs.eg.db --feature-type=ENTREZID \
--annotation-type=GO --json=example_annotations.json
# setup the gene lists
feature_files_2_json.R --file1=10_entrez.txt --file2=48_entrez.txt \
--universe=universe_entrez.txt --json=example_features.json
# do the enrichments
run_enrichment.R --features=example_features.json \
--annotations=example_annotations.json --output-file=example_enrichment.txt
# filter and find communities of related GO terms by shared feature annotations
filter_and_group.R --enrichment-results=example_enrichment.txt \
--p-cutoff=0.01 --count-cutoff=2 --similarity-file=example_similarity.rds --similarity-cutoff=0.8 \
--table-file=example_grouping.txt
# cleanup
rm -rf $WORKING
#> Setting r-universe rmarkdown theme...
#> Loading required namespace: GO.db
#>
#> Setting r-universe rmarkdown theme...
#> Setting r-universe rmarkdown theme...
#> Setting r-universe rmarkdown theme...
#> Significant Annotations:
#> Signficance Cutoffs:
#> counts >= 2
#> padjust <= 0.01
#>
#> Counts:
#> 10_entrez 48_entrez counts
#> G1 1 1 121
#> G2 1 0 143
#> G3 0 1 47
#> G4 0 0 18967
#> Saving annotation similarities in: example_similarity.rds
#> Removed 17784 edges from graphOK, let’s break that down a little more. Annotations are information about the features. For example, for gene products, we have Gene Ontology terms, that describe what pathways those products are involved in (Biological Process), any chemical transformations or binding properties those gene products might do (Molecular Function), and where in a cell or other biological structure they might be (Cellular Component). Other common gene product annotations are also biological pathway membership like those from Kyoto Encyclopedia of Genes and Genomes (KEGG), or pathways from Reactome. Similarly, chemical compounds might be annotated to pathways in KEGG and Reactome.
One of the more common feature annotations are Gene Ontology (GO)
terms. Bioconductor includes GO terms in their organism databases
(org-db), such as org.Hs.eg.db, which is the organism
database for Homo sapiens, indexed by Entrez Gene
(eg).
categoryCompare2 includes a CLI utility for getting the
GO terms from the included org-db into a form that can be used by the
rest of the CLI, create_annotations.R.
The arguments are:
create_annotations.R --orgdb=org.Hs.eg.db --feature-type=ENTREZID \
--annotation-type=GO --json=example_annotations.jsonAlternatively, if you have a source of annotations, you can pass that directly using:
create_annotations.R --input=annotations.json --feature-type=ENTREZID \
--annotation-type=GO --json=example_annotations.jsonFor example, the Moseley Bioinformatics Lab has a python project, gocats, that
enables fuller consideration of term-term relationships in the GO
ontology structure. One of the gocats sub-commands outputs
a json structured file of gene to term mappings that can be used as
input to create_annotations.R,
gocats remap_goterms.
Features are the things we’ve measured. In this example, they are are
genes. They can also be metabolites, chromosomal regions, etc. For
annotation / category enrichment, we need to know what features we
measured (universe), and then what features we are interested in. Part
of the utility of categoryCompare2 is providing the ability
to compare the enrichments from multiple lists.
feature_files_2_json.R --file1=10_entrez.txt --file2=48_entrez.txt \
--universe=universe_entrez.txt --json=example_features.jsonIn this case, we can combine up to four feature lists using the CLI. If you need to combine more than four feature lists, you might want to look at the R API directly, or write some code to create your JSON file directly. The name of the group of features is taken from the file name. The inputs are:
run_enrichment.R --features=example_features.json \
--annotations=example_annotations.json --output-file=example_enrichment.txtThis runs hypergeometric enrichment for each of the feature lists in the output json file from creating features above.
At the very least, there should be the features, annotations, and the output. A full list of options includes:
It’s very important, if you want to do anything further on the CLI
with your results, that you keep --text-only=FALSE. The
next step in the CLI uses an rds file that is generated by
default from the enrichment results. If you use
--text-only=TRUE, then you cannot do Filter and Group Annotations.
Depending on the analysis you want to do, or your needs, then that may
be fine. Just be aware.
Finally, after running the enrichments, we filter to what we think should be significant, and alternatively group GO terms by similarity and look for communities of GO terms.
filter_and_group.R --enrichment-results=example_enrichment.txt \
--p-cutoff=0.01 --count-cutoff=2 --similarity-file=example_similarity.rds \
--similarity-cutoff=0.8 --table-file=example_grouping.txtAlthough this lists the *.txt file as input, it will
actually be looking for a matching rds file to use, which
has all of the information necessary for the filtering and grouping (see
here).
Let’s go through all of the options:
If you do grouping, you definitely want to adjust
similarity-cutoff to be something higher, generally, to
make smaller groups of terms. In my own experience, I’ve found 0.8 is a
good value to use for Gene Ontology terms. For other types of
annotations, you may want to use a higher or lower similarity cutoff.
Unfortunately, the outputs you get here will depend on the values of the
p-cutoff, count-cutoff, and
similarity-cutoff used. However, you can iterate on these
fairly rapidly because the enrichment is already done. In addition, the
annotation similarity network is actually saved as an R rds
file, and as long as the same filename is used, instead of recalculating
the annotation similarities, they will be loaded from the
--similarity-file. This also speeds up the computations
considerably.
Here, we give provide examples of what the input files look like:
# 10_entrez.txt
7083
54898
10551
5111
990
4175
7031
10439
4176
9134
...
# 48_entrez.txt
7083
54898
10551
5111
990
4175
7031
10439
4176
9134
...
# universe.txt
7083
54898
10551
5111
990
4175
7031
10439
4176
9134
These will result in:
# example_features.json
{
"10_entrez": ["7083", "54898", "10551", "5111", "990", "4175", "7031", ...],
"48_entrez": ["7083", "7031", "4605", "7153", "3148", "9768", ...],
"universe": ["7083", "54898", "10551", "5111", "990", "4175", "7031", ...]
}The JSON input for annotations looks like:
{
"7083": ["GO:0005575", "GO:0005575", "GO:0005622", "GO:0005622", ...],
"7031": ["GO:0005575", "GO:0005575", "GO:0005622", "GO:0005623", ...],
...
}After running create_annotations.R, it will look like
this:
Since version 0.100.25, {categoryCompare2} can make use
of GSEA. There are some limitations in the CLI. There are currently no
options to adjust the number minimum and maximum number of annotated
features that an annotation should have to be considered for use in
GSEA, and currently those numbers default to 15 and
500. We do anticipate that those values will be changed
eventually.
For GSEA, the {categoryCompare2} CLI needs a few
different pieces of information to work:
For this small example, we are going to use the same estrogen
microarray dataset as in the main vignette. From the tested sets of
genes for each timepoint, 10 and 48 hours, we can generate a ranked list
of the genes in 10_ranked.txt and
48_ranked.txt.
test_loc = system.file("extdata", "test_data", package = "categoryCompare2")
Sys.setenv(TESTLOC = test_loc)export CURR_DIR=$(pwd)
export WORKING=$CURR_DIR/cc2_1234
mkdir -p $WORKING
cd $WORKING
cp $TESTLOC/10_entrez_ranks.txt .
cp $TESTLOC/48_entrez_ranks.txt .
# get the annotations from installed organism database
create_annotations.R --orgdb=org.Hs.eg.db --feature-type=ENTREZID --annotation-type=GO --json=example_annotations.json
# setup the gene lists
feature_files_2_json.R --file1=10_entrez_ranks.txt --file2=48_entrez_ranks.txt --json=example_features.json
# do the enrichments
run_enrichment.R --features=example_features.json --enrichment-test=gsea --annotations=example_annotations.json --output-file=example_enrichment.txt
# filter and find communities of related GO terms by shared feature annotations
filter_and_group.R --enrichment-results=example_enrichment.txt --p-cutoff=0.01 --similarity-file=example_similarity.rds --similarity-cutoff=0.8 --table-file=example_grouping.txt
# cleanup
rm -rf $WORKING
#> Setting r-universe rmarkdown theme...
#> Loading required namespace: GO.db
#>
#> Setting r-universe rmarkdown theme...
#> No "universe" provided, assuming ranked features.
#>
#> Setting r-universe rmarkdown theme...
#> Warning messages:
#> 1: In prepareStats(stats, scoreType, gseaParam) :
#> There are ties in the preranked stats (39.03% of the list).
#> The order of those tied genes will be arbitrary, which may produce unexpected results.
#> 2: In prepareStats(stats, scoreType, gseaParam) :
#> There are ties in the preranked stats (34.78% of the list).
#> The order of those tied genes will be arbitrary, which may produce unexpected results.
#> Setting r-universe rmarkdown theme...
#> Significant Annotations:
#> Signficance Cutoffs:
#> padjust <= 0.01
#>
#> Counts:
#> 10_entrez_ranks 48_entrez_ranks counts
#> G1 1 1 312
#> G2 1 0 71
#> G3 0 1 372
#> G4 0 0 4230
#> Saving annotation similarities in: example_similarity.rds
#> Removed 104682 edges from graphNote that the ranked_entrez.txt has
tabs (“) in between the columns.
# 10_ranked_entrez.txt
"feature" "rank"
"7083" 3.11373254358578
"54898" 2.93942770961125
"10551" 2.93219974980482
"5111" 2.79939639438318
"990" 2.66225839036303
"4175" 2.55528186846043
"7031" 2.8001945355083
"10439" 2.49153560769811
"4176" 2.30283141893498
"9134" 2.22390190377239
"2237" 2.20132873385815
This gets transformed to JSON with two lists, one for the “feature” and one for the “rank”: