scRUtils 0.1.0
scRUtils
provides various utilities for visualising and functional analysis of RNA-seq data,
particularly single-cell dataset. It evolved from a collection of helper functions that were
used in our in-house scRNA-seq processing workflow.
The documentation of this package is divided into 5 sections:
This vignette (#4) will demonstrate functions for markers and DEGs identified from single-cell datasets.
To use scRUtils
and relevant packages in a R session, we load them using the library()
command.
library(scRUtils)
library(ggplot2)
library(scater)
findMarkers
outputThe printMarkerStats()
function parses the findMarkers()
output (from the scran
package) and print the number of marker genes that “passed” the fdr
or top
threshold on to
standard output in interactive sessions. It provides a quick way to find out the number of
candidate marker genes.
The function can detect the pval.type
and test.type
, to a certain degree, based on the
resulting DataFrame structure. One can optionally provide the pval.type
and min.prop
settings used to run findMarkers()
to allow printMarkerStats()
to use them in the printed
messages.
The sce
demo has four findMarkers()
results stored in the metadata
slot:
findMarkers()
with pval.type = "any"
findMarkers()
with pval.type = "all"
findMarkers()
with pval.type = "any"
and min.prop = 0.5
findMarkers()
with test.type = "wilcox"
pval.type = "any"
data(sce)
# Show 1st DataFrame in the list
# DataFrame(metadata(sce)[["findMarkers1"]][[1]])
printMarkerStats(metadata(sce)[["findMarkers1"]])
## Number of selected markers (Top 200 genes of any 1 comparison):
## - A: 495; Up = 248; Down = 247; Max. P-value = 0.38.
## - B: 522; Up = 250; Down = 272; Max. P-value = 0.37.
## - C: 475; Up = 227; Down = 248; Max. P-value = 0.35.
## - D: 489; Up = 235; Down = 254; Max. P-value = 0.36.
## - E: 506; Up = 245; Down = 261; Max. P-value = 0.19.
## * Upregulated when logFC > 0.0 and downregulated when logFC < 0.0.
pval.type = "all"
# Show 1st DataFrame in the list
# DataFrame(metadata(sce)[["findMarkers2"]][[1]])
printMarkerStats(metadata(sce)[["findMarkers2"]])
## Number of selected markers (FDR at 0.05):
## - A: 19; Up = 12; Down = 7; Max. P-value = 0.00046.
## - B: 25; Up = 22; Down = 3; Max. P-value = 0.0011.
## - C: 27; Up = 16; Down = 11; Max. P-value = 0.0012.
## - D: 20; Up = 16; Down = 4; Max. P-value = 0.00051.
## - E: 23; Up = 16; Down = 7; Max. P-value = 0.00069.
## * Upregulated when logFC > 0.0 and downregulated when logFC < 0.0.
pval.type = "any"
and min.prop = 0.5
# Show 1st DataFrame in the list
# DataFrame(metadata(sce)[["findMarkers3"]][[1]])
printMarkerStats(metadata(sce)[["findMarkers3"]], min.prop = 0.5)
## Number of selected markers (Top 200 genes of at least 50.0% comparisons):
## - A: 164; Up = 80; Down = 84; Max. P-value = 0.075.
## - B: 151; Up = 76; Down = 75; Max. P-value = 0.18.
## - C: 171; Up = 82; Down = 89; Max. P-value = 0.18.
## - D: 171; Up = 69; Down = 102; Max. P-value = 0.19.
## - E: 163; Up = 77; Down = 86; Max. P-value = 0.16.
## * Upregulated when logFC > 0.0 and downregulated when logFC < 0.0.
test.type = "wilcox"
# Show 1st DataFrame in the list
# DataFrame(metadata(sce)[["findMarkers4"]][[1]])
printMarkerStats(metadata(sce)[["findMarkers4"]])
## Number of selected markers (Top 200 genes of any 1 comparison):
## - A: 155; Up = 80; Down = 75; Max. P-value = 0.0038.
## - B: 166; Up = 54; Down = 112; Max. P-value = 0.0015.
## - C: 162; Up = 43; Down = 119; Max. P-value = 0.0038.
## - D: 152; Up = 66; Down = 86; Max. P-value = 0.0025.
## - E: 147; Up = 63; Down = 84; Max. P-value = 0.0026.
## * Upregulated when AUC > 0.7 and downregulated when AUC < 0.3.
findMarkers
output DataFrameThe getMarkers()
function parses a findMarkers()
output DataFrame and returns the up- and
down-regulated marker genes that passed the specified threshold.
By specifying direction
, the function selects up-regulated markers when direction = "up"
,
down-regulated markers when direction = "down"
and both up- and down-regulated markers when
direction = "both"
(default). The definition of up/down-regulation depends on the test.type
used when running findMarkers()
.
The example below retrieves first 1000 markers (default max
) from the 1st DataFrame
(Cluster A) in ‘findMarkers1’, return rownames. There will only be
495 genes returned as shown in the
printMarkerStats()
message.
genes <- getMarkers(metadata(sce)[["findMarkers1"]][[1]])
length(genes)
## [1] 495
head(genes)
## [1] "OOX46" "VWF23" "BFP78" "DKD62" "VVU55" "CBL62"
Next example retrieves first 1000 up-regulated markers (default max
) from the 2nd DataFrame
(Cluster B) in ‘findMarkers4’, return rownames. There will only be
54 genes returned
as shown in the printMarkerStats()
message.
genes <- getMarkers(metadata(sce)[["findMarkers4"]][[2]], direction = "up")
length(genes)
## [1] 54
head(genes)
## [1] "OOX46" "CVH52" "KGI33" "BNU44" "SQS49" "IKP74"
The getDEGs()
function parses an object of class TopTags
, DGEExact
or DGELRT
from
edgeR or DESeqResults
class from DESeq2, and returns the
differentially expressed genes (DEGs) that passed the specified thresholds.
As with getMarkers()
, by specifying direction
, the function selects DEGs with a positive and
negative logFC when direction = "both"
(default), only positive logFC when direction = "up"
,
and only negative logFC when direction = "down"
.
The example below retrieves first 250 up-regulated DEGs from DESeq2’s output, return rownames (Ensembl ID).
data(res_deseq2)
genes <- getDEGs(res_deseq2, direction = "up", max = 250)
## Loading required namespace: DESeq2
length(genes)
## [1] 250
head(genes)
## [1] "FBgn0000043" "FBgn0000064" "FBgn0000071" "FBgn0000116" "FBgn0000146"
## [6] "FBgn0000147"
Next example retrieves first 1000 DEGs (default max
) at FDR = 10% from edgeR’s output,
return values from the external_gene_name
column (Gene Symbol).
data(res_edger)
genes <- getDEGs(res_edger, fdr = 0.1, column_by = "external_gene_name")
length(genes)
## [1] 1000
head(genes)
## [1] "Kal1" "Ant2" "sesB" "CG1544" "CG3770" "SPARC"
The printResList()
function takes a list of objects containing results from findMarkers()
,
edgeR or DESeq2, and save the content to text file(s).
When concatenate = TRUE
, the function concatenate the test statistics from results stored
in the list and outputs a single file.
The example below saves the results from ‘findMarkers1’ into individual files in the per-session
temporary directory tempdir()
.
exportResList(metadata(sce)[["findMarkers1"]], dir_path = tempdir())
## Detecting findMarkers input.
## Creating file: output_findMarkers_A.tsv
## Creating file: output_findMarkers_B.tsv
## Creating file: output_findMarkers_C.tsv
## Creating file: output_findMarkers_D.tsv
## Creating file: output_findMarkers_E.tsv
Next we use concatenate = TRUE
save the concatenated results from ‘findMarkers4’ in the per-
session temporary directory tempdir()
.
exportResList(metadata(sce)[["findMarkers4"]], concatenate = TRUE,
prefix = "wilcox", dir_path = tempdir())
## Detecting findMarkers input.
## Creating a concatenated file: wilcox_findMarkers_concatenated.tsv
enrichR
The runEnrichR()
function takes a list of objects containing findMarkers()
,
edgeR or DESeq2 results, and perform enrichment analysis on
the genes that passed the specified threshold.
runEnrichR()
uses enrichr()
from the enrichR package to submit gene symbols
for enrichment analysis to the Enrichr website. By default
(with column_by = NULL
), this function extracts rownames from the input DataFrames and
submits the acquired strings for the analysis.
The gene symbols are submitted to the main site (site = "Enrichr"
) with gene sets
compiled from human and mouse genes. By change the site
setting, it can use other
modEnrichr sites suitable for other model organisms.
In the example below, we construct a list of 3 DataFrames using the same DESeq2 output
(res_deseq2
) to mimic having a list containing results from multiple sets of comparisons.
As pasilla
is a RNA-seq dataset from Drosophila melanogaster, we change the site
argument to "FlyEnrichr"
to carry out the analysis.
# Construct a list of results to run analysis
res.de <- list(A_B = res_deseq2, A_C = res_deseq2, B_C = res_deseq2)
# Select gene-set libraries
dbs <- c("GO_Molecular_Function_2018", "GO_Biological_Process_2018")
# Run enrichR
# Specify the gene symbols are stored in the 'external_gene_name' column
res.ora <- runEnrichR(res.de, dbs = dbs, site = "FlyEnrichr",
column_by = "external_gene_name")
## Detecting DESeq2 input.
## Connection changed to https://maayanlab.cloud/FlyEnrichr/
##
## Connection is Live!
##
## Running enrichR on 'A_B' with 983 up- and down-regulated genes.
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2018... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
## Running enrichR on 'A_C' with 983 up- and down-regulated genes.
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2018... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
## Running enrichR on 'B_C' with 983 up- and down-regulated genes.
## Uploading data to Enrichr... Done.
## Querying GO_Molecular_Function_2018... Done.
## Querying GO_Biological_Process_2018... Done.
## Parsing results... Done.
The returned object res.ora
is a named list of list of DataFrames:
names(res.ora)
## [1] "A_B" "A_C" "B_C"
names(res.ora[["A_B"]])
## [1] "GO_Molecular_Function_2018" "GO_Biological_Process_2018"
str(res.ora[["A_B"]][["GO_Molecular_Function_2018"]])
## 'data.frame': 435 obs. of 9 variables:
## $ Term : chr "lipid-transporting ATPase activity (GO:0034040)" "sterol-transporting ATPase activity (GO:0034041)" "imaginal disc growth factor receptor binding (GO:0008084)" "anion:anion antiporter activity (GO:0015301)" ...
## $ Overlap : chr "6/10" "6/10" "4/7" "4/8" ...
## $ P.value : num 2.46e-06 2.46e-06 1.80e-04 3.46e-04 2.99e-07 ...
## $ Adjusted.P.value : num 0.000214 0.000214 0.004693 0.007532 0.000114 ...
## $ Old.P.value : num 0.000659 0.000659 0.006344 0.00897 0.000211 ...
## $ Old.Adjusted.P.value: num 0.0573 0.0573 0.2509 0.3038 0.0573 ...
## $ Z.score : num -3.17 -3.04 -4.23 -4.56 -1.85 ...
## $ Combined.Score : num 40.9 39.2 36.4 36.3 27.8 ...
## $ Genes : chr "CG5853;CG3164;CG9663;CG31689;CG4822;Atet" "CG5853;CG3164;CG9663;CG31689;CG4822;Atet" "Idgf1;Idgf4;Idgf3;Idgf6" "sesB;Ant2;Prestin;Ndae1" ...
The plotEnrichR()
function takes a list of list of DataFrames containing Enrichr results
returned by runEnrichR()
and create barplots from a selected gene-set library.
It prints barplots from each group of comparison using grid.draw()
from the grid
package one after another, therefore the plotting feature is suitable only in Jupyter Notebook or
R Markdown. Otherwise, disable plotting by specifying the prefix
argument to save barplots to
PDF files.
plotEnrichR(res.ora, db = "GO_Biological_Process_2018", theme_size = 14)
Alternatively, we can export barplots to PDF files in the per-session temporary directory
tempdir()
as shown below.
plotEnrichR(res.ora, db = "GO_Biological_Process_2018", prefix = "Enrichr",
dir_path = tempdir(), width = 12, height = 5)
## Creating file: Enrichr_A_B_GO_Biological_Process_2018.pdf
## Creating file: Enrichr_A_C_GO_Biological_Process_2018.pdf
## Creating file: Enrichr_B_C_GO_Biological_Process_2018.pdf
The printEnrichR()
function takes a list of list of DataFrames containing Enrichr results
returned by runEnrichR()
and save all the results in the list object to individual text files.
We are not using printEnrich()
from the enrichR package to print results due
to a bug not yet fixed in the current verion in CRAN at the time of writing (v3.0).
printEnrichR(res.ora, prefix = "Enrichr", dir_path = tempdir())
## Creating file: Enrichr_A_B_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_A_B_GO_Biological_Process_2018.tsv
## Creating file: Enrichr_A_C_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_A_C_GO_Biological_Process_2018.tsv
## Creating file: Enrichr_B_C_GO_Molecular_Function_2018.tsv
## Creating file: Enrichr_B_C_GO_Biological_Process_2018.tsv
sessionInfo()
## R version 4.1.3 (2022-03-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.4 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3/envs/jupyterlab/lib/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_GB.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.22.0 scuttle_1.4.0
## [3] SingleCellExperiment_1.16.0 SummarizedExperiment_1.24.0
## [5] Biobase_2.54.0 GenomicRanges_1.46.1
## [7] GenomeInfoDb_1.30.1 IRanges_2.28.0
## [9] S4Vectors_0.32.3 BiocGenerics_0.40.0
## [11] MatrixGenerics_1.6.0 matrixStats_0.61.0
## [13] ggplot2_3.3.5 scRUtils_0.1.0
## [15] BiocStyle_2.22.0
##
## loaded via a namespace (and not attached):
## [1] ggnewscale_0.4.7 ggbeeswarm_0.6.0
## [3] colorspace_2.0-3 rjson_0.2.21
## [5] ellipsis_0.3.2 bluster_1.4.0
## [7] XVector_0.34.0 BiocNeighbors_1.12.0
## [9] farver_2.1.0 enrichR_3.0
## [11] ggrepel_0.9.1 bit64_4.0.5
## [13] AnnotationDbi_1.56.2 fansi_1.0.3
## [15] splines_4.1.3 sparseMatrixStats_1.6.0
## [17] cachem_1.0.6 geneplotter_1.72.0
## [19] knitr_1.38 polyclip_1.10-0
## [21] jsonlite_1.8.0 annotate_1.72.0
## [23] cluster_2.1.3 png_0.1-7
## [25] ggforce_0.3.3 BiocManager_1.30.16
## [27] compiler_4.1.3 httr_1.4.2
## [29] dqrng_0.3.0 assertthat_0.2.1
## [31] Matrix_1.4-1 fastmap_1.1.0
## [33] limma_3.50.1 cli_3.2.0
## [35] tweenr_1.0.2 BiocSingular_1.10.0
## [37] htmltools_0.5.2 tools_4.1.3
## [39] rsvd_1.0.5 igraph_1.3.0
## [41] gtable_0.3.0 glue_1.6.2
## [43] GenomeInfoDbData_1.2.7 dplyr_1.0.8
## [45] Rcpp_1.0.8.3 jquerylib_0.1.4
## [47] Biostrings_2.62.0 vctrs_0.4.1
## [49] DelayedMatrixStats_1.16.0 xfun_0.30
## [51] stringr_1.4.0 beachmat_2.10.0
## [53] lifecycle_1.0.1 irlba_2.3.5
## [55] statmod_1.4.36 XML_3.99-0.9
## [57] edgeR_3.36.0 zlibbioc_1.40.0
## [59] MASS_7.3-56 scales_1.2.0
## [61] parallel_4.1.3 RColorBrewer_1.1-3
## [63] curl_4.3.2 yaml_2.3.5
## [65] memoise_2.0.1 gridExtra_2.3
## [67] sass_0.4.1 stringi_1.7.6
## [69] RSQLite_2.2.10 highr_0.9
## [71] genefilter_1.76.0 ScaledMatrix_1.2.0
## [73] scran_1.22.1 BiocParallel_1.28.3
## [75] rlang_1.0.2 pkgconfig_2.0.3
## [77] bitops_1.0-7 evaluate_0.15
## [79] lattice_0.20-45 purrr_0.3.4
## [81] labeling_0.4.2 cowplot_1.1.1
## [83] bit_4.0.4 tidyselect_1.1.2
## [85] magrittr_2.0.3 bookdown_0.26
## [87] DESeq2_1.34.0 R6_2.5.1
## [89] magick_2.7.3 generics_0.1.2
## [91] metapod_1.2.0 DelayedArray_0.20.0
## [93] DBI_1.1.2 pillar_1.7.0
## [95] withr_2.5.0 survival_3.3-1
## [97] KEGGREST_1.34.0 RCurl_1.98-1.6
## [99] tibble_3.1.6 crayon_1.5.1
## [101] utf8_1.2.2 rmarkdown_2.13
## [103] viridis_0.6.2 locfit_1.5-9.5
## [105] grid_4.1.3 blob_1.2.3
## [107] digest_0.6.29 xtable_1.8-4
## [109] tidyr_1.2.0 munsell_0.5.0
## [111] beeswarm_0.4.0 viridisLite_0.4.0
## [113] vipor_0.4.5 bslib_0.3.1