--- title: "Gene Set Enrichment Analysis" author: "Robert M Flight" date: "`r Sys.time()`" editor_options: chunk_output_type: console --- ## Introduction `categoryCompare2` was originally designed to work with enrichments generated via *hypergeometric* enrichment, or *over-representation*. However, there are some limitations to that method, some of which can possibly be overcome using *gene-set enrichment analysis*, or GSEA. This vignette shows how to use `categoryCompare2` to work with GSEA enrichments. ## Sample Data To make the concept more concrete, we will examine data from the microarray data set `estrogen` available from Bioconductor. This data set contains 8 samples, with 2 levels of estrogen therapy (present vs absent), and two time points (10 and 48 hours). A pre-processed version of the data is available with this package, the commands used to generate it are below. Note: the preprocessed one keeps only the top 100 genes, if you use it the results will be slightly different than those shown in the vignette. ```{r loadLibs, message=FALSE} library("affy") library("hgu95av2.db") library("genefilter") library("estrogen") library("limma") library("categoryCompare2") library("GO.db") library("org.Hs.eg.db") ``` ```{r loadMeta} datadir <- system.file("extdata", package = "estrogen") pd <- read.AnnotatedDataFrame( file.path(datadir, "estrogen.txt"), header = TRUE, sep = "", row.names = 1 ) pData(pd) ``` Here you can see the descriptions for each of the arrays. First, we will read in the cel files, and then normalize the data using RMA. ```{r loadAffy} currDir <- getwd() setwd(datadir) a <- ReadAffy(filenames = rownames(pData(pd)), phenoData = pd, verbose = TRUE) setwd(currDir) ``` ```{r normalizeAffy, message=FALSE, eval=FALSE} eData <- affy::rma(a) ``` To make it easier to conceptualize, we will split the data up into two eSet objects by time, and perform all of the manipulations for calculating significantly differentially expressed genes on each eSet object. So for the 10 hour samples: ```{r edata10} e_file <- system.file( "extdata/test_data/estrogen_edata.rds", package = "categoryCompare2" ) eData <- readRDS(e_file) e10 <- eData[, eData$time.h == 10] e10 <- nsFilter( e10, remove.dupEntrez = TRUE, var.filter = FALSE, feature.exclude = "^AFFX" )$eset e10$estrogen <- factor(e10$estrogen) d10 <- model.matrix(~ 0 + e10$estrogen) colnames(d10) <- unique(e10$estrogen) fit10 <- lmFit(e10, d10) c10 <- makeContrasts(present - absent, levels = d10) fit10_2 <- contrasts.fit(fit10, c10) eB10 <- eBayes(fit10_2) table10 <- topTable(eB10, number = nrow(e10), p.value = 1, adjust.method = "BH") table10$Entrez <- unlist(mget( rownames(table10), hgu95av2ENTREZID, ifnotfound = NA )) ``` And the 48 hour samples we do the same thing: ```{r edata48} e48 <- eData[, eData$time.h == 48] e48 <- nsFilter( e48, remove.dupEntrez = TRUE, var.filter = FALSE, feature.exclude = "^AFFX" )$eset e48$estrogen <- factor(e48$estrogen) d48 <- model.matrix(~ 0 + e48$estrogen) colnames(d48) <- unique(e48$estrogen) fit48 <- lmFit(e48, d48) c48 <- makeContrasts(present - absent, levels = d48) fit48_2 <- contrasts.fit(fit48, c48) eB48 <- eBayes(fit48_2) table48 <- topTable(eB48, number = nrow(e48), p.value = 1, adjust.method = "BH") table48$Entrez <- unlist(mget( rownames(table48), hgu95av2ENTREZID, ifnotfound = NA )) ``` And grab all the genes on the array to have a background set. For both time points we have generated a list of genes that are differentially expressed in the present vs absent samples. We will calculate GSEA enrichments using `fgsea`, and then compare the enrichments between the two timepoints. ## Create Annotations and Enrich ```{r createGeneList, message=FALSE} bp_annotation = get_db_annotation( "org.Hs.eg.db", features = table10$Entrez, annotation_type = "BP" ) g10_ranks = table10$logFC names(g10_ranks) = table10$Entrez g10_features = new( "gsea_features", ranks = g10_ranks, annotation = bp_annotation ) g10_enrich = gsea_feature_enrichment( g10_features, min_features = 20, max_features = 200 ) g48_ranks = table48$logFC names(g48_ranks) = table48$Entrez g48_features = new( "gsea_features", ranks = g48_ranks, annotation = bp_annotation ) g48_enrich = gsea_feature_enrichment( g48_features, min_features = 20, max_features = 200 ) ``` ## Combine and Find Significant ```{r combine_bp} bp_combined <- combine_enrichments(g10 = g10_enrich, g48 = g48_enrich) ``` ```{r sig_bp} bp_sig <- get_significant_annotations(bp_combined, padjust <= 0.001) bp_sig@statistics@significant ``` ## Generate Graph ```{r graphs} bp_graph <- generate_annotation_graph(bp_sig) bp_graph bp_graph <- remove_edges(bp_graph, 0.8) bp_graph ``` ```{r colors} bp_assign <- annotation_combinations(bp_graph) bp_assign <- assign_colors(bp_assign) ``` ### Find Communities It is useful to define the annotations in terms of their **communities**. To do this we run methods that find and then label the communities, before generating the visualization and table. ```{r find_communities} bp_communities <- assign_communities(bp_graph) bp_comm_labels <- label_communities(bp_communities, bp_annotation) ``` ### Visualize It ```{r bp_visnetwork} bp_network <- graph_to_visnetwork(bp_graph, bp_assign, bp_comm_labels) ``` ```{r out_vis, eval = FALSE} vis_visnetwork(bp_network) ``` ```{r bp_legend, echo = FALSE} generate_legend(bp_assign) ```