1 Overview

scRUtils provides various utilities for visualising and functional analysis of RNA-seq data, particularly single-cell dataset. It evolved from a collection of helper functions that were used in our in-house scRNA-seq processing workflow.

The documentation of this package is divided into 5 sections:

This vignette (#4) will demonstrate functions for markers and DEGs identified from single-cell datasets.

2 Load packages

To use scRUtils and relevant packages in a R session, we load them using the library() command.

library(scRUtils)

library(ggplot2)
library(scater)

3 Markers and DEGs

3.1 Print marker gene stats from `findMarkers` output

The printMarkerStats() function parses the findMarkers() output (from the scran package) and print the number of marker genes that “passed” the fdr or top threshold on to standard output in interactive sessions. It provides a quick way to find out the number of candidate marker genes.

The function can detect the pval.type and test.type, to a certain degree, based on the resulting DataFrame structure. One can optionally provide the pval.type and min.prop settings used to run findMarkers() to allow printMarkerStats() to use them in the printed messages.

The sce demo has four findMarkers() results stored in the metadata slot:

running findMarkers() with pval.type = "any"
running findMarkers() with pval.type = "all"
running findMarkers() with pval.type = "any" and min.prop = 0.5
running findMarkers() with test.type = "wilcox"

3.1.1 With `pval.type = "any"`

data(sce)

# Show 1st DataFrame in the list
# DataFrame(metadata(sce)[["findMarkers1"]][[1]])
printMarkerStats(metadata(sce)[["findMarkers1"]])

## Number of selected markers (Top 200 genes of any 1 comparison):
## - A: 495; Up = 248; Down = 247; Max. P-value = 0.38.
## - B: 522; Up = 250; Down = 272; Max. P-value = 0.37.
## - C: 475; Up = 227; Down = 248; Max. P-value = 0.35.
## - D: 489; Up = 235; Down = 254; Max. P-value = 0.36.
## - E: 506; Up = 245; Down = 261; Max. P-value = 0.19.
## * Upregulated when logFC > 0.0 and downregulated when logFC < 0.0.

3.1.2 With `pval.type = "all"`

# Show 1st DataFrame in the list
# DataFrame(metadata(sce)[["findMarkers2"]][[1]])
printMarkerStats(metadata(sce)[["findMarkers2"]])

## Number of selected markers (FDR at 0.05):
## - A: 19; Up = 12; Down = 7; Max. P-value = 0.00046.
## - B: 25; Up = 22; Down = 3; Max. P-value = 0.0011.
## - C: 27; Up = 16; Down = 11; Max. P-value = 0.0012.
## - D: 20; Up = 16; Down = 4; Max. P-value = 0.00051.
## - E: 23; Up = 16; Down = 7; Max. P-value = 0.00069.
## * Upregulated when logFC > 0.0 and downregulated when logFC < 0.0.

3.1.3 With `pval.type = "any"` and `min.prop = 0.5`

# Show 1st DataFrame in the list
# DataFrame(metadata(sce)[["findMarkers3"]][[1]])
printMarkerStats(metadata(sce)[["findMarkers3"]], min.prop = 0.5)

## Number of selected markers (Top 200 genes of at least 50.0% comparisons):
## - A: 164; Up = 80; Down = 84; Max. P-value = 0.075.
## - B: 151; Up = 76; Down = 75; Max. P-value = 0.18.
## - C: 171; Up = 82; Down = 89; Max. P-value = 0.18.
## - D: 171; Up = 69; Down = 102; Max. P-value = 0.19.
## - E: 163; Up = 77; Down = 86; Max. P-value = 0.16.
## * Upregulated when logFC > 0.0 and downregulated when logFC < 0.0.

3.1.4 With `test.type = "wilcox"`

# Show 1st DataFrame in the list
# DataFrame(metadata(sce)[["findMarkers4"]][[1]])
printMarkerStats(metadata(sce)[["findMarkers4"]])

## Number of selected markers (Top 200 genes of any 1 comparison):
## - A: 155; Up = 80; Down = 75; Max. P-value = 0.0038.
## - B: 166; Up = 54; Down = 112; Max. P-value = 0.0015.
## - C: 162; Up = 43; Down = 119; Max. P-value = 0.0038.
## - D: 152; Up = 66; Down = 86; Max. P-value = 0.0025.
## - E: 147; Up = 63; Down = 84; Max. P-value = 0.0026.
## * Upregulated when AUC > 0.7 and downregulated when AUC < 0.3.

3.2 Retrieve marker genes from a `findMarkers` output DataFrame

The getMarkers() function parses a findMarkers() output DataFrame and returns the up- and down-regulated marker genes that passed the specified threshold.

By specifying direction, the function selects up-regulated markers when direction = "up", down-regulated markers when direction = "down" and both up- and down-regulated markers when direction = "both" (default). The definition of up/down-regulation depends on the test.type used when running findMarkers().

The example below retrieves first 1000 markers (default max) from the 1st DataFrame (Cluster A) in ‘findMarkers1’, return rownames. There will only be 495 genes returned as shown in the printMarkerStats() message.

genes <- getMarkers(metadata(sce)[["findMarkers1"]][[1]])
length(genes)

## [1] 495

head(genes)

## [1] "OOX46" "VWF23" "BFP78" "DKD62" "VVU55" "CBL62"

Next example retrieves first 1000 up-regulated markers (default max) from the 2nd DataFrame (Cluster B) in ‘findMarkers4’, return rownames. There will only be 54 genes returned as shown in the printMarkerStats() message.

genes <- getMarkers(metadata(sce)[["findMarkers4"]][[2]], direction  = "up")
length(genes)

## [1] 54

head(genes)

## [1] "OOX46" "CVH52" "KGI33" "BNU44" "SQS49" "IKP74"

3.3 Retrieve differentially expressed genes from edgeR or DESeq2 analysis

The getDEGs() function parses an object of class TopTags, DGEExact or DGELRT from edgeR or DESeqResults class from DESeq2, and returns the differentially expressed genes (DEGs) that passed the specified thresholds.

As with getMarkers(), by specifying direction, the function selects DEGs with a positive and negative logFC when direction = "both" (default), only positive logFC when direction = "up", and only negative logFC when direction = "down".

The example below retrieves first 250 up-regulated DEGs from DESeq2’s output, return rownames (Ensembl ID).

data(res_deseq2)

genes <- getDEGs(res_deseq2, direction = "up", max = 250)

## Loading required namespace: DESeq2

length(genes)

## [1] 250

head(genes)

## [1] "FBgn0000043" "FBgn0000064" "FBgn0000071" "FBgn0000116" "FBgn0000146"
## [6] "FBgn0000147"

Next example retrieves first 1000 DEGs (default max) at FDR = 10% from edgeR’s output, return values from the external_gene_name column (Gene Symbol).

data(res_edger)

genes <- getDEGs(res_edger, fdr = 0.1, column_by = "external_gene_name")
length(genes)

## [1] 1000

head(genes)

## [1] "Kal1"   "Ant2"   "sesB"   "CG1544" "CG3770" "SPARC"

3.4 Export results from findMarkers, edgeR or DESeq2 to tsv

The printResList() function takes a list of objects containing results from findMarkers(), edgeR or DESeq2, and save the content to text file(s).

When concatenate = TRUE, the function concatenate the test statistics from results stored in the list and outputs a single file.

The example below saves the results from ‘findMarkers1’ into individual files in the per-session temporary directory tempdir().

exportResList(metadata(sce)[["findMarkers1"]], dir_path = tempdir())

## Detecting findMarkers input.
## Creating file: output_findMarkers_A.tsv
## Creating file: output_findMarkers_B.tsv
## Creating file: output_findMarkers_C.tsv
## Creating file: output_findMarkers_D.tsv
## Creating file: output_findMarkers_E.tsv

Next we use concatenate = TRUE save the concatenated results from ‘findMarkers4’ in the per- session temporary directory tempdir().

exportResList(metadata(sce)[["findMarkers4"]], concatenate = TRUE,
              prefix = "wilcox", dir_path = tempdir())

## Detecting findMarkers input.
## Creating a concatenated file: wilcox_findMarkers_concatenated.tsv

4 Enrichment analysis with `enrichR`

4.1 Perform enrichR analysis on results from findMarkers, edgeR or DESeq2

The runEnrichR() function takes a list of objects containing findMarkers(), edgeR or DESeq2 results, and perform enrichment analysis on the genes that passed the specified threshold.

runEnrichR() uses enrichr() from the enrichR package to submit gene symbols for enrichment analysis to the Enrichr website. By default (with column_by = NULL), this function extracts rownames from the input DataFrames and submits the acquired strings for the analysis.

The gene symbols are submitted to the main site (site = "Enrichr") with gene sets compiled from human and mouse genes. By change the site setting, it can use other modEnrichr sites suitable for other model organisms.

In the example below, we construct a list of 3 DataFrames using the same DESeq2 output (res_deseq2) to mimic having a list containing results from multiple sets of comparisons. As pasilla is a RNA-seq dataset from Drosophila melanogaster, we change the site argument to "FlyEnrichr" to carry out the analysis.

# Construct a list of results to run analysis
res.de <- list(A_B = res_deseq2, A_C = res_deseq2, B_C = res_deseq2)

# Select gene-set libraries
dbs <- c("GO_Molecular_Function_2018", "GO_Biological_Process_2018")

# Run enrichR
# Specify the gene symbols are stored in the 'external_gene_name' column
res.ora <- runEnrichR(res.de, dbs = dbs, site = "FlyEnrichr",
                      column_by = "external_gene_name")

## Detecting DESeq2 input.
## Connection changed to https://maayanlab.cloud/FlyEnrichr/
## 
## Connection is Live!
## 
## Running enrichR on 'A_B' with 983 up- and down-regulated genes.

## Uploading data to Enrichr... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.

## Running enrichR on 'A_C' with 983 up- and down-regulated genes.

## Uploading data to Enrichr... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.

## Running enrichR on 'B_C' with 983 up- and down-regulated genes.

## Uploading data to Enrichr... Done.
##   Querying GO_Molecular_Function_2018... Done.
##   Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.

The returned object res.ora is a named list of list of DataFrames:

On the top-level are lists with names from original DESeq2 input, i.e. “A_B, A_C, B_C”.
In each of them are lists with the names of selected gene-set libraries, i.e. “GO_Molecular_Function_2018, GO_Biological_Process_2018”.
Finally, in each is a DataFrame containing the enrichment results from a gene-set libraries.

names(res.ora)

## [1] "A_B" "A_C" "B_C"

names(res.ora[["A_B"]])

## [1] "GO_Molecular_Function_2018" "GO_Biological_Process_2018"

str(res.ora[["A_B"]][["GO_Molecular_Function_2018"]])

## 'data.frame':    435 obs. of  9 variables:
##  $ Term                : chr  "lipid-transporting ATPase activity (GO:0034040)" "sterol-transporting ATPase activity (GO:0034041)" "imaginal disc growth factor receptor binding (GO:0008084)" "anion:anion antiporter activity (GO:0015301)" ...
##  $ Overlap             : chr  "6/10" "6/10" "4/7" "4/8" ...
##  $ P.value             : num  2.46e-06 2.46e-06 1.80e-04 3.46e-04 2.99e-07 ...
##  $ Adjusted.P.value    : num  0.000214 0.000214 0.004693 0.007532 0.000114 ...
##  $ Old.P.value         : num  0.000659 0.000659 0.006344 0.00897 0.000211 ...
##  $ Old.Adjusted.P.value: num  0.0573 0.0573 0.2509 0.3038 0.0573 ...
##  $ Z.score             : num  -3.17 -3.04 -4.23 -4.56 -1.85 ...
##  $ Combined.Score      : num  40.9 39.2 36.4 36.3 27.8 ...
##  $ Genes               : chr  "CG5853;CG3164;CG9663;CG31689;CG4822;Atet" "CG5853;CG3164;CG9663;CG31689;CG4822;Atet" "Idgf1;Idgf4;Idgf3;Idgf6" "sesB;Ant2;Prestin;Ndae1" ...

4.2 Visualise Enrichr results as barplots

The plotEnrichR() function takes a list of list of DataFrames containing Enrichr results returned by runEnrichR() and create barplots from a selected gene-set library.

It prints barplots from each group of comparison using grid.draw() from the grid package one after another, therefore the plotting feature is suitable only in Jupyter Notebook or R Markdown. Otherwise, disable plotting by specifying the prefix argument to save barplots to PDF files.

plotEnrichR(res.ora, db = "GO_Biological_Process_2018", theme_size = 14)

Alternatively, we can export barplots to PDF files in the per-session temporary directory tempdir() as shown below.

plotEnrichR(res.ora, db = "GO_Biological_Process_2018", prefix = "Enrichr",
            dir_path = tempdir(), width = 12, height = 5)

## Creating file: Enrichr_A_B_GO_Biological_Process_2018.pdf
## Creating file: Enrichr_A_C_GO_Biological_Process_2018.pdf
## Creating file: Enrichr_B_C_GO_Biological_Process_2018.pdf

4.3 Export results from Enrichr to tsv

The printEnrichR() function takes a list of list of DataFrames containing Enrichr results returned by runEnrichR() and save all the results in the list object to individual text files.

We are not using printEnrich() from the enrichR package to print results due to a bug not yet fixed in the current verion in CRAN at the time of writing (v3.0).

printEnrichR(res.ora, prefix = "Enrichr", dir_path = tempdir())

## Creating file: Enrichr_A_B_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_A_B_GO_Biological_Process_2018.tsv
## Creating file: Enrichr_A_C_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_A_C_GO_Biological_Process_2018.tsv
## Creating file: Enrichr_B_C_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_B_C_GO_Biological_Process_2018.tsv

Session information

sessionInfo()

## R version 4.1.3 (2022-03-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3/envs/jupyterlab/lib/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.22.0               scuttle_1.4.0              
##  [3] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
##  [5] Biobase_2.54.0              GenomicRanges_1.46.1       
##  [7] GenomeInfoDb_1.30.1         IRanges_2.28.0             
##  [9] S4Vectors_0.32.3            BiocGenerics_0.40.0        
## [11] MatrixGenerics_1.6.0        matrixStats_0.61.0         
## [13] ggplot2_3.3.5               scRUtils_0.1.0             
## [15] BiocStyle_2.22.0           
## 
## loaded via a namespace (and not attached):
##   [1] ggnewscale_0.4.7          ggbeeswarm_0.6.0         
##   [3] colorspace_2.0-3          rjson_0.2.21             
##   [5] ellipsis_0.3.2            bluster_1.4.0            
##   [7] XVector_0.34.0            BiocNeighbors_1.12.0     
##   [9] farver_2.1.0              enrichR_3.0              
##  [11] ggrepel_0.9.1             bit64_4.0.5              
##  [13] AnnotationDbi_1.56.2      fansi_1.0.3              
##  [15] splines_4.1.3             sparseMatrixStats_1.6.0  
##  [17] cachem_1.0.6              geneplotter_1.72.0       
##  [19] knitr_1.38                polyclip_1.10-0          
##  [21] jsonlite_1.8.0            annotate_1.72.0          
##  [23] cluster_2.1.3             png_0.1-7                
##  [25] ggforce_0.3.3             BiocManager_1.30.16      
##  [27] compiler_4.1.3            httr_1.4.2               
##  [29] dqrng_0.3.0               assertthat_0.2.1         
##  [31] Matrix_1.4-1              fastmap_1.1.0            
##  [33] limma_3.50.1              cli_3.2.0                
##  [35] tweenr_1.0.2              BiocSingular_1.10.0      
##  [37] htmltools_0.5.2           tools_4.1.3              
##  [39] rsvd_1.0.5                igraph_1.3.0             
##  [41] gtable_0.3.0              glue_1.6.2               
##  [43] GenomeInfoDbData_1.2.7    dplyr_1.0.8              
##  [45] Rcpp_1.0.8.3              jquerylib_0.1.4          
##  [47] Biostrings_2.62.0         vctrs_0.4.1              
##  [49] DelayedMatrixStats_1.16.0 xfun_0.30                
##  [51] stringr_1.4.0             beachmat_2.10.0          
##  [53] lifecycle_1.0.1           irlba_2.3.5              
##  [55] statmod_1.4.36            XML_3.99-0.9             
##  [57] edgeR_3.36.0              zlibbioc_1.40.0          
##  [59] MASS_7.3-56               scales_1.2.0             
##  [61] parallel_4.1.3            RColorBrewer_1.1-3       
##  [63] curl_4.3.2                yaml_2.3.5               
##  [65] memoise_2.0.1             gridExtra_2.3            
##  [67] sass_0.4.1                stringi_1.7.6            
##  [69] RSQLite_2.2.10            highr_0.9                
##  [71] genefilter_1.76.0         ScaledMatrix_1.2.0       
##  [73] scran_1.22.1              BiocParallel_1.28.3      
##  [75] rlang_1.0.2               pkgconfig_2.0.3          
##  [77] bitops_1.0-7              evaluate_0.15            
##  [79] lattice_0.20-45           purrr_0.3.4              
##  [81] labeling_0.4.2            cowplot_1.1.1            
##  [83] bit_4.0.4                 tidyselect_1.1.2         
##  [85] magrittr_2.0.3            bookdown_0.26            
##  [87] DESeq2_1.34.0             R6_2.5.1                 
##  [89] magick_2.7.3              generics_0.1.2           
##  [91] metapod_1.2.0             DelayedArray_0.20.0      
##  [93] DBI_1.1.2                 pillar_1.7.0             
##  [95] withr_2.5.0               survival_3.3-1           
##  [97] KEGGREST_1.34.0           RCurl_1.98-1.6           
##  [99] tibble_3.1.6              crayon_1.5.1             
## [101] utf8_1.2.2                rmarkdown_2.13           
## [103] viridis_0.6.2             locfit_1.5-9.5           
## [105] grid_4.1.3                blob_1.2.3               
## [107] digest_0.6.29             xtable_1.8-4             
## [109] tidyr_1.2.0               munsell_0.5.0            
## [111] beeswarm_0.4.0            viridisLite_0.4.0        
## [113] vipor_0.4.5               bslib_0.3.1

Utilities for markers and DEGs from scRUtils

04/19/2022

Package

1 Overview

2 Load packages

3 Markers and DEGs

3.1 Print marker gene stats from `findMarkers` output

3.1.1 With `pval.type = "any"`

3.1.2 With `pval.type = "all"`

3.1.3 With `pval.type = "any"` and `min.prop = 0.5`

3.1.4 With `test.type = "wilcox"`

3.2 Retrieve marker genes from a `findMarkers` output DataFrame

3.3 Retrieve differentially expressed genes from edgeR or DESeq2 analysis

3.4 Export results from findMarkers, edgeR or DESeq2 to tsv

4 Enrichment analysis with `enrichR`

4.1 Perform enrichR analysis on results from findMarkers, edgeR or DESeq2

4.2 Visualise Enrichr results as barplots

4.3 Export results from Enrichr to tsv

Session information

Utilities for markers and DEGs from scRUtils

04/19/2022

Package

1 Overview

2 Load packages

3 Markers and DEGs

3.1 Print marker gene stats from findMarkers output

3.1.1 With pval.type = "any"

3.1.2 With pval.type = "all"

3.1.3 With pval.type = "any" and min.prop = 0.5

3.1.4 With test.type = "wilcox"

3.2 Retrieve marker genes from a findMarkers output DataFrame

3.3 Retrieve differentially expressed genes from edgeR or DESeq2 analysis

3.4 Export results from findMarkers, edgeR or DESeq2 to tsv

4 Enrichment analysis with enrichR

4.1 Perform enrichR analysis on results from findMarkers, edgeR or DESeq2

4.2 Visualise Enrichr results as barplots

4.3 Export results from Enrichr to tsv

Session information

3.1 Print marker gene stats from `findMarkers` output

3.1.1 With `pval.type = "any"`

3.1.2 With `pval.type = "all"`

3.1.3 With `pval.type = "any"` and `min.prop = 0.5`

3.1.4 With `test.type = "wilcox"`

3.2 Retrieve marker genes from a `findMarkers` output DataFrame

4 Enrichment analysis with `enrichR`