1 Overview

scRUtils provides various utilities for visualising and functional analysis of RNA-seq data, particularly single-cell dataset. It evolved from a collection of helper functions that were used in our in-house scRNA-seq processing workflow.

The documentation of this package is divided into 5 sections:

  1. Introduction to scRUtils
  2. General data visualisations
  3. Single-cell visualisations
  4. Markers and DEGs
  5. Demo datasets

This vignette (#4) will demonstrate functions for markers and DEGs identified from single-cell datasets.

2 Load packages

To use scRUtils and relevant packages in a R session, we load them using the library() command.

library(scRUtils)

library(ggplot2)
library(scater)

3 Markers and DEGs

3.2 Retrieve marker genes from a findMarkers output DataFrame

The getMarkers() function parses a findMarkers() output DataFrame and returns the up- and down-regulated marker genes that passed the specified threshold.

By specifying direction, the function selects up-regulated markers when direction = "up", down-regulated markers when direction = "down" and both up- and down-regulated markers when direction = "both" (default). The definition of up/down-regulation depends on the test.type used when running findMarkers().

The example below retrieves first 1000 markers (default max) from the 1st DataFrame (Cluster A) in ‘findMarkers1’, return rownames. There will only be 495 genes returned as shown in the printMarkerStats() message.

genes <- getMarkers(metadata(sce)[["findMarkers1"]][[1]])
length(genes)
## [1] 495
head(genes)
## [1] "OOX46" "VWF23" "BFP78" "DKD62" "VVU55" "CBL62"

Next example retrieves first 1000 up-regulated markers (default max) from the 2nd DataFrame (Cluster B) in ‘findMarkers4’, return rownames. There will only be 54 genes returned as shown in the printMarkerStats() message.

genes <- getMarkers(metadata(sce)[["findMarkers4"]][[2]], direction  = "up")
length(genes)
## [1] 54
head(genes)
## [1] "OOX46" "CVH52" "KGI33" "BNU44" "SQS49" "IKP74"

3.3 Retrieve differentially expressed genes from edgeR or DESeq2 analysis

The getDEGs() function parses an object of class TopTags, DGEExact or DGELRT from edgeR or DESeqResults class from DESeq2, and returns the differentially expressed genes (DEGs) that passed the specified thresholds.

As with getMarkers(), by specifying direction, the function selects DEGs with a positive and negative logFC when direction = "both" (default), only positive logFC when direction = "up", and only negative logFC when direction = "down".

The example below retrieves first 250 up-regulated DEGs from DESeq2’s output, return rownames (Ensembl ID).

data(res_deseq2)

genes <- getDEGs(res_deseq2, direction = "up", max = 250)
## Loading required namespace: DESeq2
length(genes)
## [1] 250
head(genes)
## [1] "FBgn0000043" "FBgn0000064" "FBgn0000071" "FBgn0000116" "FBgn0000146"
## [6] "FBgn0000147"

Next example retrieves first 1000 DEGs (default max) at FDR = 10% from edgeR’s output, return values from the external_gene_name column (Gene Symbol).

data(res_edger)

genes <- getDEGs(res_edger, fdr = 0.1, column_by = "external_gene_name")
length(genes)
## [1] 1000
head(genes)
## [1] "Kal1"   "Ant2"   "sesB"   "CG1544" "CG3770" "SPARC"

3.4 Export results from findMarkers, edgeR or DESeq2 to tsv

The printResList() function takes a list of objects containing results from findMarkers(), edgeR or DESeq2, and save the content to text file(s).

When concatenate = TRUE, the function concatenate the test statistics from results stored in the list and outputs a single file.

The example below saves the results from ‘findMarkers1’ into individual files in the per-session temporary directory tempdir().

exportResList(metadata(sce)[["findMarkers1"]], dir_path = tempdir())
## Detecting findMarkers input.
## Creating file: output_findMarkers_A.tsv
## Creating file: output_findMarkers_B.tsv
## Creating file: output_findMarkers_C.tsv
## Creating file: output_findMarkers_D.tsv
## Creating file: output_findMarkers_E.tsv

Next we use concatenate = TRUE save the concatenated results from ‘findMarkers4’ in the per- session temporary directory tempdir().

exportResList(metadata(sce)[["findMarkers4"]], concatenate = TRUE,
              prefix = "wilcox", dir_path = tempdir())
## Detecting findMarkers input.
## Creating a concatenated file: wilcox_findMarkers_concatenated.tsv

4 Enrichment analysis with enrichR

4.1 Perform enrichR analysis on results from findMarkers, edgeR or DESeq2

The runEnrichR() function takes a list of objects containing findMarkers(), edgeR or DESeq2 results, and perform enrichment analysis on the genes that passed the specified threshold.

runEnrichR() uses enrichr() from the enrichR package to submit gene symbols for enrichment analysis to the Enrichr website. By default (with column_by = NULL), this function extracts rownames from the input DataFrames and submits the acquired strings for the analysis.

The gene symbols are submitted to the main site (site = "Enrichr") with gene sets compiled from human and mouse genes. By change the site setting, it can use other modEnrichr sites suitable for other model organisms.

In the example below, we construct a list of 3 DataFrames using the same DESeq2 output (res_deseq2) to mimic having a list containing results from multiple sets of comparisons. As pasilla is a RNA-seq dataset from Drosophila melanogaster, we change the site argument to "FlyEnrichr" to carry out the analysis.

# Construct a list of results to run analysis
res.de <- list(A_B = res_deseq2, A_C = res_deseq2, B_C = res_deseq2)

# Select gene-set libraries
dbs <- c("GO_Molecular_Function_2018", "GO_Biological_Process_2018")

# Run enrichR
# Specify the gene symbols are stored in the 'external_gene_name' column
res.ora <- runEnrichR(res.de, dbs = dbs, site = "FlyEnrichr",
                      column_by = "external_gene_name")
## Detecting DESeq2 input.
## Connection changed to https://maayanlab.cloud/FlyEnrichr/
## 
## Connection is Live!
## 
## Running enrichR on 'A_B' with 983 up- and down-regulated genes.
## Uploading data to Enrichr... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
## Running enrichR on 'A_C' with 983 up- and down-regulated genes.
## Uploading data to Enrichr... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
## Running enrichR on 'B_C' with 983 up- and down-regulated genes.
## Uploading data to Enrichr... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.

The returned object res.ora is a named list of list of DataFrames:

  • On the top-level are lists with names from original DESeq2 input, i.e. “A_B, A_C, B_C”.
  • In each of them are lists with the names of selected gene-set libraries, i.e. “GO_Molecular_Function_2018, GO_Biological_Process_2018”.
  • Finally, in each is a DataFrame containing the enrichment results from a gene-set libraries.
names(res.ora)
## [1] "A_B" "A_C" "B_C"
names(res.ora[["A_B"]])
## [1] "GO_Molecular_Function_2018" "GO_Biological_Process_2018"
str(res.ora[["A_B"]][["GO_Molecular_Function_2018"]])
## 'data.frame':    435 obs. of  9 variables:
##  $ Term                : chr  "lipid-transporting ATPase activity (GO:0034040)" "sterol-transporting ATPase activity (GO:0034041)" "imaginal disc growth factor receptor binding (GO:0008084)" "anion:anion antiporter activity (GO:0015301)" ...
##  $ Overlap             : chr  "6/10" "6/10" "4/7" "4/8" ...
##  $ P.value             : num  2.46e-06 2.46e-06 1.80e-04 3.46e-04 2.99e-07 ...
##  $ Adjusted.P.value    : num  0.000214 0.000214 0.004693 0.007532 0.000114 ...
##  $ Old.P.value         : num  0.000659 0.000659 0.006344 0.00897 0.000211 ...
##  $ Old.Adjusted.P.value: num  0.0573 0.0573 0.2509 0.3038 0.0573 ...
##  $ Z.score             : num  -3.17 -3.04 -4.23 -4.56 -1.85 ...
##  $ Combined.Score      : num  40.9 39.2 36.4 36.3 27.8 ...
##  $ Genes               : chr  "CG5853;CG3164;CG9663;CG31689;CG4822;Atet" "CG5853;CG3164;CG9663;CG31689;CG4822;Atet" "Idgf1;Idgf4;Idgf3;Idgf6" "sesB;Ant2;Prestin;Ndae1" ...

4.2 Visualise Enrichr results as barplots

The plotEnrichR() function takes a list of list of DataFrames containing Enrichr results returned by runEnrichR() and create barplots from a selected gene-set library.

It prints barplots from each group of comparison using grid.draw() from the grid package one after another, therefore the plotting feature is suitable only in Jupyter Notebook or R Markdown. Otherwise, disable plotting by specifying the prefix argument to save barplots to PDF files.

plotEnrichR(res.ora, db = "GO_Biological_Process_2018", theme_size = 14)

Alternatively, we can export barplots to PDF files in the per-session temporary directory tempdir() as shown below.

plotEnrichR(res.ora, db = "GO_Biological_Process_2018", prefix = "Enrichr",
            dir_path = tempdir(), width = 12, height = 5)
## Creating file: Enrichr_A_B_GO_Biological_Process_2018.pdf
## Creating file: Enrichr_A_C_GO_Biological_Process_2018.pdf
## Creating file: Enrichr_B_C_GO_Biological_Process_2018.pdf

4.3 Export results from Enrichr to tsv

The printEnrichR() function takes a list of list of DataFrames containing Enrichr results returned by runEnrichR() and save all the results in the list object to individual text files.

We are not using printEnrich() from the enrichR package to print results due to a bug not yet fixed in the current verion in CRAN at the time of writing (v3.0).

printEnrichR(res.ora, prefix = "Enrichr", dir_path = tempdir())
## Creating file: Enrichr_A_B_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_A_B_GO_Biological_Process_2018.tsv
## Creating file: Enrichr_A_C_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_A_C_GO_Biological_Process_2018.tsv
## Creating file: Enrichr_B_C_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_B_C_GO_Biological_Process_2018.tsv

Session information

sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3/envs/jupyterlab/lib/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.22.0               scuttle_1.4.0              
##  [3] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
##  [5] Biobase_2.54.0              GenomicRanges_1.46.1       
##  [7] GenomeInfoDb_1.30.1         IRanges_2.28.0             
##  [9] S4Vectors_0.32.3            BiocGenerics_0.40.0        
## [11] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## [13] ggplot2_3.3.5               scRUtils_0.1.0             
## [15] BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] ggnewscale_0.4.7          ggbeeswarm_0.6.0         
##   [3] colorspace_2.0-3          rjson_0.2.21             
##   [5] ellipsis_0.3.2            bluster_1.4.0            
##   [7] XVector_0.34.0            BiocNeighbors_1.12.0     
##   [9] farver_2.1.0              enrichR_3.0              
##  [11] ggrepel_0.9.1             bit64_4.0.5              
##  [13] AnnotationDbi_1.56.2      fansi_1.0.3              
##  [15] splines_4.1.3             sparseMatrixStats_1.6.0  
##  [17] cachem_1.0.6              geneplotter_1.72.0       
##  [19] knitr_1.38                polyclip_1.10-0          
##  [21] jsonlite_1.8.0            annotate_1.72.0          
##  [23] cluster_2.1.3             png_0.1-7                
##  [25] ggforce_0.3.3             BiocManager_1.30.16      
##  [27] compiler_4.1.3            httr_1.4.2               
##  [29] dqrng_0.3.0               assertthat_0.2.1         
##  [31] Matrix_1.4-1              fastmap_1.1.0            
##  [33] limma_3.50.1              cli_3.2.0                
##  [35] tweenr_1.0.2              BiocSingular_1.10.0      
##  [37] htmltools_0.5.2           tools_4.1.3              
##  [39] rsvd_1.0.5                igraph_1.3.0             
##  [41] gtable_0.3.0              glue_1.6.2               
##  [43] GenomeInfoDbData_1.2.7    dplyr_1.0.8              
##  [45] Rcpp_1.0.8.3              jquerylib_0.1.4          
##  [47] Biostrings_2.62.0         vctrs_0.4.1              
##  [49] DelayedMatrixStats_1.16.0 xfun_0.30                
##  [51] stringr_1.4.0             beachmat_2.10.0          
##  [53] lifecycle_1.0.1           irlba_2.3.5              
##  [55] statmod_1.4.36            XML_3.99-0.9             
##  [57] edgeR_3.36.0              zlibbioc_1.40.0          
##  [59] MASS_7.3-56               scales_1.2.0             
##  [61] parallel_4.1.3            RColorBrewer_1.1-3       
##  [63] curl_4.3.2                yaml_2.3.5               
##  [65] memoise_2.0.1             gridExtra_2.3            
##  [67] sass_0.4.1                stringi_1.7.6            
##  [69] RSQLite_2.2.10            highr_0.9                
##  [71] genefilter_1.76.0         ScaledMatrix_1.2.0       
##  [73] scran_1.22.1              BiocParallel_1.28.3      
##  [75] rlang_1.0.2               pkgconfig_2.0.3          
##  [77] bitops_1.0-7              evaluate_0.15            
##  [79] lattice_0.20-45           purrr_0.3.4              
##  [81] labeling_0.4.2            cowplot_1.1.1            
##  [83] bit_4.0.4                 tidyselect_1.1.2         
##  [85] magrittr_2.0.3            bookdown_0.26            
##  [87] DESeq2_1.34.0             R6_2.5.1                 
##  [89] magick_2.7.3              generics_0.1.2           
##  [91] metapod_1.2.0             DelayedArray_0.20.0      
##  [93] DBI_1.1.2                 pillar_1.7.0             
##  [95] withr_2.5.0               survival_3.3-1           
##  [97] KEGGREST_1.34.0           RCurl_1.98-1.6           
##  [99] tibble_3.1.6              crayon_1.5.1             
## [101] utf8_1.2.2                rmarkdown_2.13           
## [103] viridis_0.6.2             locfit_1.5-9.5           
## [105] grid_4.1.3                blob_1.2.3               
## [107] digest_0.6.29             xtable_1.8-4             
## [109] tidyr_1.2.0               munsell_0.5.0            
## [111] beeswarm_0.4.0            viridisLite_0.4.0        
## [113] vipor_0.4.5               bslib_0.3.1