clusterProfiler: Bioconductor, Paper, Documentation

enrichplot: Bioconductor

Demo Dataset: E-MTAB-8411 from The clock gene Bmal1 inhibits macrophage motility, phagocytosis, and impairs defense against pneumonia. PNAS. 2020;117(3):1543-1551.

License: GPL-3.0

Start R

cd /ngs/GO-Enrichment-Analysis-Demo

R

Load package and set path

library("clusterProfiler")
library("enrichplot")
library("org.Mm.eg.db")
library("ggplot2")

Load data

If you have downloaded the DESeq2_DEG.txt file with wget:

data <- data.table::fread("DESeq2_DEG.txt")
data$GeneID <- substr(data$GeneID, 1, 18)

If you like to donwload the file in R now:

data <- data.table::fread("https://raw.githubusercontent.com/ycl6/GO-Enrichment-Analysis-Demo/master/DESeq2_DEG.txt")
data$GeneID <- substr(data$GeneID, 1, 18)
data
##                    GeneID GeneSymbol      log2fc    pvalue      padj
##     1: ENSMUSG00000000001      Gnai3  0.08804493 0.1925609 0.6146732
##     2: ENSMUSG00000000003       Pbsn          NA        NA        NA
##     3: ENSMUSG00000000028      Cdc45 -0.32106635 0.1401127 0.5331437
##     4: ENSMUSG00000000031        H19 -1.20339889 0.7161464        NA
##     5: ENSMUSG00000000037      Scml2 -0.57746426 0.2979159        NA
##    ---                                                              
## 55381: ENSMUSG00000118636 AC117663.3          NA        NA        NA
## 55382: ENSMUSG00000118637 AL772212.1          NA        NA        NA
## 55383: ENSMUSG00000118638 AL805980.1          NA        NA        NA
## 55384: ENSMUSG00000118639 AL590997.4          NA        NA        NA
## 55385: ENSMUSG00000118640 AC167036.2  0.06415390 0.9208414        NA

Define significance threshold

up.idx <- which(data$padj < 0.05 & data$log2fc > 0) # FDR < 0.05 and logFC > 0
dn.idx <- which(data$padj < 0.05 & data$log2fc < 0) # FDR < 0.05 and logFC < 0
dim(data)
## [1] 55385     5
length(up.idx)
## [1] 383
length(dn.idx)
## [1] 429

Define significant genes

all.genes <- data$GeneSymbol
up.genes <- data[up.idx,]$GeneSymbol
dn.genes <- data[dn.idx,]$GeneSymbol
head(up.genes, 10)
##  [1] "Axin2"  "Hnrnpd" "Kcnn3"  "Mapk7"  "Agpat3" "Sema6b" "Efnb2"  "Il16"   "Ltbp1" 
## [10] "Rgs19"
head(dn.genes, 10)
##  [1] "Cox5a"  "Pdgfb"  "Itga5"  "Cd52"   "Dnmt3l" "Tubb6"  "Ell2"   "Ifrd1"  "Stk38l"
## [10] "Ubl3"

Alternatively, if you only have Ensembl gene ID

all.genes <- data$GeneID
up.genes <- data[up.idx,]$GeneID
dn.genes <- data[dn.idx,]$GeneID

Decide the sub-ontology to test

  • BP: Biological Process
  • CC: Cellular Component
  • MF: Molecular Function
ontology <- "BP"

Set outfile prefix

outTitle <- paste0("clusterProfiler_GO-", ontology, "_ORA_simplify")
outTitle
## [1] "clusterProfiler_GO-BP_ORA_simplify"

Prepare input data

We use all the annotated genes (all.genes) as the “gene universe” and the differentially expressed genes (up.genes and dn.genes) as our genes of interest.

We would need to convert any other identifier format to ENTREZID which is the required input identifier format. This can be done by using the select function from AnnotationDbi that we saw in Part 1 of this demo, or by using the “Biological Id TRanslator” bitr function from clusterProfiler which is a wrapper function of AnnotationDbi::select.

Here, we will use bitr here to see how this can be done.

# Use fromType = "ENSEMBL" if your input identifier is Ensembl gene ID
all.genes.df = bitr(all.genes, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Mm.eg.db")
## 'select()' returned 1:1 mapping between keys and columns
head(all.genes.df, 10)
##    SYMBOL ENTREZID
## 1   Gnai3    14679
## 2    Pbsn    54192
## 3   Cdc45    12544
## 4     H19    14955
## 5   Scml2   107815
## 6    Apoh    11818
## 7    Narf    67608
## 8    Cav2    12390
## 9    Klf6    23849
## 10  Scmh1    29871
up.genes.df = bitr(up.genes, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Mm.eg.db")
## 'select()' returned 1:1 mapping between keys and columns
head(up.genes.df, 10)
##    SYMBOL ENTREZID
## 1   Axin2    12006
## 2  Hnrnpd    11991
## 3   Kcnn3   140493
## 4   Mapk7    23939
## 5  Agpat3    28169
## 6  Sema6b    20359
## 7   Efnb2    13642
## 8    Il16    16170
## 9   Ltbp1   268977
## 10  Rgs19    56470
dn.genes.df = bitr(dn.genes, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Mm.eg.db")
## 'select()' returned 1:1 mapping between keys and columns
head(dn.genes.df, 10)
##    SYMBOL ENTREZID
## 1   Cox5a    12858
## 2   Pdgfb    18591
## 3   Itga5    16402
## 4    Cd52    23833
## 5  Dnmt3l    54427
## 6   Tubb6    67951
## 7    Ell2   192657
## 8   Ifrd1    15982
## 9  Stk38l   232533
## 10   Ubl3    24109

Perform enrichment analysis

  • P-value cutoff on enrichment tests is set at 0.05.
  • P-values adjustment method is set to “BH” (Benjamini & Hochberg, same as “fdr”).
  • Q-value cutoff is set at 0.05. Results that must pass (1) pvalueCutoff on unadjusted p-values, (2) pvalueCutoff on adjusted p-values and (3) qvalueCutoff on q-values.
upEGO = enrichGO(gene = up.genes.df$ENTREZID, universe = all.genes.df$ENTREZID, 
         OrgDb = "org.Mm.eg.db", ont = ontology, pvalueCutoff = 0.05, 
         pAdjustMethod = "BH", qvalueCutoff = 0.05, readable = TRUE)
upEGO
## #
## # over-representation test
## #
## #...@organism     Mus musculus 
## #...@ontology     BP 
## #...@keytype      ENTREZID 
## #...@gene     chr [1:370] "12006" "11991" "140493" "23939" "28169" "20359" "13642" "16170" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...80 enriched terms found
## 'data.frame':    80 obs. of  9 variables:
##  $ ID         : chr  "GO:0050900" "GO:0060326" "GO:0050764" "GO:0046394" ...
##  $ Description: chr  "leukocyte migration" "cell chemotaxis" "regulation of phagocytosis" "carboxylic acid biosynthetic process" ...
##  $ GeneRatio  : chr  "25/342" "18/342" "11/342" "17/342" ...
##  $ BgRatio    : chr  "358/22036" "301/22036" "111/22036" "298/22036" ...
##  $ pvalue     : num  0.000000000445 0.000001256598 0.000001273094 0.000004674059 0.000004886976 ...
##  $ p.adjust   : num  0.00000168 0.0016007 0.0016007 0.00296886 0.00296886 ...
##  $ qvalue     : num  0.0000014 0.0013365 0.0013365 0.0024789 0.0024789 ...
##  $ geneID     : chr  "Il16/Lbp/Mmp9/Lgmn/Cd200r1/Alox5/Itgb1/Selp/Itga4/Itga6/Bst1/Coro1a/Itgam/Cadm1/Vav3/Ripor2/Gcnt1/Itga9/Ptger4/"| __truncated__ "Il16/Lbp/Elmo2/Lgmn/Alox5/Bst1/Coro1a/Itgam/Fgfr1/Bcar1/Vav3/Dock4/Ripor2/Itga9/Mtus1/Nod2/Ano6/C5ar2" "Lbp/Alox15/Pot1b/Mfge8/Rab27a/Cd300lf/Nod2/Prtn3/Ano6/Sirpb1c/Gm5150" "Fcer1a/Tha1/Ptgis/Alox15/Acaca/Acly/Mtr/Alox5/Hacd4/Sdsl/Ilvbl/Ccnd3/Dse/Ptger4/Abcd2/Per2/Ldha" ...
##  $ Count      : int  25 18 11 17 17 21 16 14 14 14 ...
## #...Citation
##   Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
##   clusterProfiler: an R package for comparing biological themes among
##   gene clusters. OMICS: A Journal of Integrative Biology
##   2012, 16(5):284-287
dnEGO = enrichGO(gene = dn.genes.df$ENTREZID, universe = all.genes.df$ENTREZID,
                 OrgDb = "org.Mm.eg.db", ont = ontology, pvalueCutoff = 0.05,
                 pAdjustMethod = "BH", qvalueCutoff = 0.05, readable = TRUE)
dnEGO
## #
## # over-representation test
## #
## #...@organism     Mus musculus 
## #...@ontology     BP 
## #...@keytype      ENTREZID 
## #...@gene     chr [1:422] "12858" "18591" "16402" "23833" "54427" "67951" "192657" "15982" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...191 enriched terms found
## 'data.frame':    191 obs. of  9 variables:
##  $ ID         : chr  "GO:0002224" "GO:0042116" "GO:0002221" "GO:0072593" ...
##  $ Description: chr  "toll-like receptor signaling pathway" "macrophage activation" "pattern recognition receptor signaling pathway" "reactive oxygen species metabolic process" ...
##  $ GeneRatio  : chr  "15/389" "13/389" "15/389" "20/389" ...
##  $ BgRatio    : chr  "108/22036" "98/22036" "149/22036" "285/22036" ...
##  $ pvalue     : num  0.000000000759 0.000000018349 0.000000065767 0.000000193465 0.00000027366 ...
##  $ p.adjust   : num  0.000003 0.0000363 0.0000867 0.0001913 0.0002065 ...
##  $ qvalue     : num  0.00000247 0.00002982 0.00007126 0.00015722 0.0001697 ...
##  $ geneID     : chr  "Nr1h3/Cyba/Ptprs/Mapkapk2/Tnip1/Nr1d1/Traf3/Acod1/Tlr2/Lrrfip2/Tnip3/Tlr1/Ticam1/Rab7b/Irak2" "Nr1h3/Nr1d1/Gpr137b/Ifngr2/Trem2/Slc11a1/Tlr2/Slc7a2/Cebpa/Tlr1/Ticam1/Cst7/Syt11" "Nr1h3/Cyba/Ptprs/Mapkapk2/Tnip1/Nr1d1/Traf3/Acod1/Tlr2/Lrrfip2/Tnip3/Tlr1/Ticam1/Rab7b/Irak2" "Pdgfb/Cyba/Nfe2l2/Ncf1/Txnrd1/Acod1/Cdkn1a/Prdx6/Cat/Eif6/Tlr2/Prdx1/Sesn2/Slc25a33/Ndufc2/Slc7a2/Prex1/Shc1/Ticam1/Ass1" ...
##  $ Count      : int  15 13 15 20 24 27 17 22 14 22 ...
## #...Citation
##   Guangchuang Yu, Li-Gen Wang, Yanyan Han and Qing-Yu He.
##   clusterProfiler: an R package for comparing biological themes among
##   gene clusters. OMICS: A Journal of Integrative Biology
##   2012, 16(5):284-287

Reduce GO term redundancy

We then use the simplify function to reduce redundancy of enriched GO terms. We will used the default parameters to run the function.

upSimGO = simplify(upEGO, cutoff = 0.7, by = "p.adjust", select_fun = min, measure = "Wang", 
           semData = NULL)
nrow(upSimGO)
## [1] 51
dnSimGO = simplify(dnEGO, cutoff = 0.7, by = "p.adjust", select_fun = min, measure = "Wang", 
           semData = NULL)
nrow(dnSimGO)
## [1] 103

Plot enrichment

We will use the barplot function from enrichplot (previously a function in clusterProfiler)

Up-regulated genes

png(paste0(outTitle, "_up.png"), width = 9, height = 6, units = "in", res = 300)
barplot(upSimGO, showCategory = 20) + 
    ggtitle(paste0("GO-", ontology," ORA of up-regulated genes")) + 
    xlab("Enriched terms") + ylab("Count")
invisible(dev.off())
"clusterProfiler_GO-BP_ORA_simplify_up.png"

“clusterProfiler_GO-BP_ORA_simplify_up.png”

Down-regulated genes

png(paste0(outTitle, "_dn.png"), width = 9, height = 6, units = "in", res = 300)
barplot(dnSimGO, showCategory = 20) + 
    ggtitle(paste0("GO-", ontology," ORA of down-regulated genes")) +
        xlab("Enriched terms") + ylab("Count")
invisible(dev.off())
"clusterProfiler_GO-BP_ORA_simplify_dn.png"

“clusterProfiler_GO-BP_ORA_simplify_dn.png”

Create result table

Save enrichGO results as a data.frame

up.tab = upSimGO@result
dn.tab = dnSimGO@result

Output results to files

write.table(up.tab, file = paste0(outTitle, "_up.txt"), sep = "\t", quote = F, 
        row.names = F, col.names = T)

write.table(dn.tab, file = paste0(outTitle, "_dn.txt"), sep = "\t", quote = F, 
        row.names = F, col.names = T)

Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4/lib/libopenblasp-r0.3.12.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
##  [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
## [9] base     
## 
## other attached packages:
##  [1] ggplot2_3.3.3          org.Mm.eg.db_3.12.0    AnnotationDbi_1.52.0  
##  [4] IRanges_2.24.1         S4Vectors_0.28.1       Biobase_2.50.0        
##  [7] BiocGenerics_0.36.0    enrichplot_1.10.2      clusterProfiler_3.18.1
## [10] knitr_1.31            
## 
## loaded via a namespace (and not attached):
##  [1] bit64_4.0.5         RColorBrewer_1.1-2  tools_4.0.3         bslib_0.2.4        
##  [5] utf8_1.1.4          R6_2.5.0            DBI_1.1.1           colorspace_2.0-0   
##  [9] withr_2.4.1         tidyselect_1.1.0    gridExtra_2.3       bit_4.0.4          
## [13] compiler_4.0.3      scatterpie_0.1.5    labeling_0.4.2      shadowtext_0.0.7   
## [17] sass_0.3.1          scales_1.1.1        stringr_1.4.0       digest_0.6.27      
## [21] rmarkdown_2.7       DOSE_3.16.0         pkgconfig_2.0.3     htmltools_0.5.1.1  
## [25] highr_0.8           fastmap_1.1.0       rlang_0.4.10        RSQLite_2.2.3      
## [29] jquerylib_0.1.3     farver_2.1.0        generics_0.1.0      jsonlite_1.7.2     
## [33] BiocParallel_1.24.1 GOSemSim_2.16.1     dplyr_1.0.4         magrittr_2.0.1     
## [37] GO.db_3.12.1        Matrix_1.3-2        Rcpp_1.0.6          munsell_0.5.0      
## [41] fansi_0.4.2         viridis_0.5.1       lifecycle_1.0.0     stringi_1.5.3      
## [45] yaml_2.2.1          ggraph_2.0.5        MASS_7.3-53.1       plyr_1.8.6         
## [49] qvalue_2.22.0       grid_4.0.3          blob_1.2.1          ggrepel_0.9.1      
## [53] DO.db_2.9           crayon_1.4.1        lattice_0.20-41     graphlayouts_0.7.1 
## [57] cowplot_1.1.1       splines_4.0.3       pillar_1.5.0        fgsea_1.16.0       
## [61] igraph_1.2.6        reshape2_1.4.4      fastmatch_1.1-0     glue_1.4.2         
## [65] evaluate_0.14       downloader_0.4      data.table_1.14.0   BiocManager_1.30.10
## [69] png_0.1-7           vctrs_0.3.6         tweenr_1.0.1        gtable_0.3.0       
## [73] purrr_0.3.4         polyclip_1.10-0     tidyr_1.1.3         assertthat_0.2.1   
## [77] cachem_1.0.4        xfun_0.21           ggforce_0.3.2       tidygraph_1.2.0    
## [81] viridisLite_0.3.0   tibble_3.1.0        rvcheck_0.1.8       memoise_2.0.0      
## [85] ellipsis_0.3.1