Bioconductor AnnotationData Packages: http://www.bioconductor.org/packages/release/data/annotation/
AnnotationHub:: https://bioconductor.org/packages/AnnotationHub/
License: GPL-3.0
There are many organism-level (org) packages readily available on Bioconductor. They provide mappings between a central identifier (e.g. Entrez Gene identifiers) and other identifiers (e.g. ensembl ID, Refseq Identifiers, GO Identifiers, etc).
The name of an org package is always of the form org.<Sp>.<id>.db (e.g. org.Hs.eg.db) where <Sp> is a 2-letter abbreviation of the organism (e.g. Hs for Homo sapiens) and <id> is an abbreviation (in lower-case) describing the type of central identifier (e.g. eg for gene identifiers assigned by the Entrez Gene, or sgd for Saccharomyces Genome Database). Most of the Bioconductor annotation packages are updated every 6 months.
Rcd /ngs/GO-Enrichment-Analysis-Demo
R
BiocManagerList available organism-level packages for installation in BiocManager.
BiocManager::available("^org\\.")## [1] "org.Ag.eg.db" "org.At.tair.db" "org.Bt.eg.db" "org.Ce.eg.db"
## [5] "org.Cf.eg.db" "org.Dm.eg.db" "org.Dr.eg.db" "org.EcK12.eg.db"
## [9] "org.EcSakai.eg.db" "org.Gg.eg.db" "org.Hs.eg.db" "org.Mm.eg.db"
## [13] "org.Mmu.eg.db" "org.Mxanthus.db" "org.Pf.plasmo.db" "org.Pt.eg.db"
## [17] "org.Rn.eg.db" "org.Sc.sgd.db" "org.Ss.eg.db" "org.Xl.eg.db"
org packageAs an example, let’s download and install the Arabidopsis thaliana (thale cress) package.
BiocManager::install("org.At.tair.db")## Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
## Installing package(s) 'org.At.tair.db'
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
## Old packages: 'dplyr', 'ggforce', 'topGO'
library(org.At.tair.db)## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport,
## clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply,
## parSapply, parSapplyLB
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int,
## pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit, which.max, which.min
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with 'browseVignettes()'. To
## cite Bioconductor, see 'citation("Biobase")', and for packages
## 'citation("pkgname")'.
##
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
##
## The following object is masked from 'package:base':
##
## expand.grid
org.At.tair.db## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: ARABIDOPSIS_DB
## | ORGANISM: Arabidopsis thaliana
## | SPECIES: Arabidopsis
## | TAIRSOURCENAME: Tair
## | TAIRSOURCEDATE: 2020-Sep28
## | TAIRSOURCEURL: https://www.arabidopsis.org/
## | TAIRGOURL: https://www.arabidopsis.org/download_files/GO_and_PO_Annotations/Gene_Ontology_Annotations/ATH_GO_GOSLIM.txt
## | TAIRGENEURL: https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_functional_descriptions
## | TAIRSYMBOLURL: https://www.arabidopsis.org/download_files/Public_Data_Releases/TAIR_Data_20190630/gene_aliases_20190630.txt.gz
## | TAIRPATHURL: ftp://ftp.plantcyc.org/Pathways/Data_dumps/PMN14_January2020/pathways/Ara_pathways.20200125
## | TAIRPMIDURL: https://www.arabidopsis.org/download_files/Public_Data_Releases/TAIR_Data_20190630/Locus_Published_20190630.txt.gz
## | TAIRCHRURL: https://www.arabidopsis.org/download_files/Maps/seqviewer_data/sv_gene.data
## | TAIRATHURL: https://www.arabidopsis.org/download_files/Microarrays/Affymetrix/affy_ATH1_array_elements-2010-12-20.txt
## | TAIRAGURL: https://www.arabidopsis.org/download_files/Microarrays/Affymetrix/affy_AG_array_elements-2010-12-20.txt
## | CENTRALID: TAIR
## | TAXID: 3702
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2020-09-10
## | GOEGSOURCEDATE: 2020-Sep23
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | EGSOURCEDATE: 2020-Sep23
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
##
## Please see: help('select') for usage information
AnnotationHubAbove method returns a limited number of organism-level annotation packages. There are a lot more packages available from the Bioconductor’s AnnotationHub service.
To search, download and install packages from the AnnotationHub service, install AnnotationHub if it is not yet installed in your machine.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("AnnotationHub")## Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
## Installing package(s) 'AnnotationHub'
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
## Old packages: 'dplyr', 'ggforce', 'topGO'
AnnotationHub objectlibrary(AnnotationHub)## Loading required package: BiocFileCache
## Loading required package: dbplyr
##
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
##
## cache
ah <- AnnotationHub()## using temporary cache /tmp/RtmpYREXDV/BiocFileCache
## snapshotDate(): 2020-10-27
# URL for the online AnnotationHub
hubUrl(ah)## [1] "https://annotationhub.bioconductor.org"
ah## AnnotationHub with 55232 records
## # snapshotDate(): 2020-10-27
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/,...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus, Pan trogl...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, ChainFile, TxDb, In...
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based,
## # maintainer, rdatadateadded, preparerclass, tags, rdatapath, sourceurl,
## # sourcetype
## # retrieve records with, e.g., 'object[["AH5012"]]'
##
## title
## AH5012 | Chromosome Band
## AH5013 | STS Markers
## AH5014 | FISH Clones
## AH5015 | Recomb Rate
## AH5016 | ENCODE Pilot
## ... ...
## AH89567 | Ensembl 103 EnsDb for Xiphophorus couchianus
## AH89568 | Ensembl 103 EnsDb for Xiphophorus maculatus
## AH89569 | Ensembl 103 EnsDb for Xenopus tropicalis
## AH89570 | Ensembl 103 EnsDb for Zonotrichia albicollis
## AH89571 | Ensembl 103 EnsDb for Zalophus californianus
# Number of resources
length(ah)## [1] 55232
org recordsSearch for organism-level packages with a pattern-matching string “^org\\.”.
db <- query(ah, "^org\\.")
df <- mcols(db)
class(df)## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"
Show query results stored in DFrame.
# Column names
cbind(colnames(df))## [,1]
## [1,] "title"
## [2,] "dataprovider"
## [3,] "species"
## [4,] "taxonomyid"
## [5,] "genome"
## [6,] "description"
## [7,] "coordinate_1_based"
## [8,] "maintainer"
## [9,] "rdatadateadded"
## [10,] "preparerclass"
## [11,] "tags"
## [12,] "rdataclass"
## [13,] "rdatapath"
## [14,] "sourceurl"
## [15,] "sourcetype"
# Number of org records
nrow(df)## [1] 1695
# Show df
df[,c("title", "species")]## DataFrame with 1695 rows and 2 columns
## title species
## <character> <character>
## AH84113 org.Ag.eg.db.sqlite Anopheles gambiae
## AH84114 org.At.tair.db.sqlite Arabidopsis thaliana
## AH84115 org.Bt.eg.db.sqlite Bos taurus
## AH84116 org.Cf.eg.db.sqlite Canis familiaris
## AH84117 org.Gg.eg.db.sqlite Gallus gallus
## ... ... ...
## AH87062 org.Schizosaccharomy.. Schizosaccharomyces ..
## AH87063 org.Burkholderia_ant.. Burkholderia anthina
## AH87064 org.Ascoidea_rubesce.. Ascoidea rubescens_D..
## AH87065 org.Burkholderia_pse.. Burkholderia pseudom..
## AH87066 org.Halogeometricum_.. Halogeometricum bori..
org packageLet’s search and install the Felis catus (cat) package.
# Search df with keyword
data.frame(df[grep("Felis", df$species), c("title", "species", "rdatadateadded")])## title species rdatadateadded
## AH85554 org.Felis_catus.eg.sqlite Felis catus 2020-10-27
## AH85555 org.Felis_domesticus.eg.sqlite Felis domesticus 2020-10-27
## AH85556 org.Felis_silvestris_catus.eg.sqlite Felis silvestris_catus 2020-10-27
## AH85835 org.Felis_canadensis.eg.sqlite Felis canadensis 2020-10-27
## AH86137 org.Felis_concolor.eg.sqlite Felis concolor 2020-10-27
# Retrieve package with for "Felis catus"
rn <- rownames(df[df$species == "Felis catus",])
org.Fc.eg.db <- ah[[rn]]## downloading 1 resources
## retrieving 1 resource
## loading from cache
org.Fc.eg.db## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Felis catus
## | SPECIES: Felis catus
## | CENTRALID: GID
## | Taxonomy ID: 9685
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
##
## Please see: help('select') for usage information
recordStatus(ah, rn)## record status dateadded
## 1 AH85554 Public 2020-10-27
After retrieving an annotation package, it will be placed in the local AnnotationHub cache. You can used it again without having to download the package.
# Location of the local AnnotationHub cache
hubCache(ah)## [1] "~/.cache/AnnotationHub"
# Load from cache
org.Fc.eg.db <- ah[[rn]]## loading from cache
You can use the removeCache function to removes all local AnnotationHub database and all related resources.
removeCache(ah, ask = TRUE)org db objectscolumnsShows which kinds of data can be returned for the AnnotationDb object.
Both objects contain Gene Ontology mapping information.
columns(org.At.tair.db)## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
columns(org.Fc.eg.db)## [1] "ACCNUM" "ALIAS" "CHR" "ENSEMBL" "ENTREZID" "EVIDENCE"
## [7] "EVIDENCEALL" "GENENAME" "GID" "GO" "GOALL" "ONTOLOGY"
## [13] "ONTOLOGYALL" "PMID" "REFSEQ" "SYMBOL"
keytypesShows which columns can be used as keys.
keytypes(org.At.tair.db)## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
keytypes(org.Fc.eg.db)## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENTREZID" "EVIDENCE" "EVIDENCEALL"
## [7] "GENENAME" "GID" "GO" "GOALL" "ONTOLOGY" "ONTOLOGYALL"
## [13] "PMID" "REFSEQ" "SYMBOL"
keysReturns values (or keys) that can be expected for a given keytype. By default it will return the primary keys for the database.
head(keys(org.At.tair.db), 10) # Primary keys## [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060" "AT1G01070"
## [8] "AT1G01073" "AT1G01080" "AT1G01090"
head(keys(org.At.tair.db, keytype = "SYMBOL"), 10)## [1] "ANAC001" "NAC001" "NTL10" "ARV1" "NGA3" "ASU1" "ATDCL1" "CAF"
## [9] "DCL1" "EMB60"
head(keys(org.At.tair.db, keytype = "GO"), 10)## [1] "GO:0003700" "GO:0005634" "GO:0006355" "GO:0003674" "GO:0005739" "GO:0005783"
## [7] "GO:0005794" "GO:0006665" "GO:0009507" "GO:0016125"
head(keys(org.Fc.eg.db), 10) # Primary keys## [1] "414734" "445455" "448843" "492297" "492308" "493648" "493649" "493650" "493651"
## [10] "493652"
head(keys(org.Fc.eg.db, keytype = "SYMBOL"), 10)## [1] "A1BG" "A1CF" "A2M" "A2ML1" "A3GALT2" "A4GALT" "A4GNT" "AAAS"
## [9] "AACS" "AADAC"
head(keys(org.Fc.eg.db, keytype = "GO"), 10)## [1] "GO:0000002" "GO:0000003" "GO:0000010" "GO:0000012" "GO:0000014" "GO:0000015"
## [7] "GO:0000026" "GO:0000027" "GO:0000028" "GO:0000030"
selectRetrieve the data as a data.frame based on parameters for selected keys, columns and keytype arguments.
myKeys <- head(keys(org.At.tair.db, keytype = "TAIR"), 10)
myKeys## [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060" "AT1G01070"
## [8] "AT1G01073" "AT1G01080" "AT1G01090"
select(org.At.tair.db, keys = myKeys, columns = "SYMBOL", keytype = "TAIR")## 'select()' returned 1:many mapping between keys and columns
## TAIR SYMBOL
## 1 AT1G01010 ANAC001
## 2 AT1G01010 NAC001
## 3 AT1G01010 NTL10
## 4 AT1G01020 ARV1
## 5 AT1G01030 NGA3
## 6 AT1G01040 ASU1
## 7 AT1G01040 ATDCL1
## 8 AT1G01040 CAF
## 9 AT1G01040 DCL1
## 10 AT1G01040 EMB60
## 11 AT1G01040 EMB76
## 12 AT1G01040 SIN1
## 13 AT1G01040 SUS1
## 14 AT1G01050 AtPPa1
## 15 AT1G01050 PPa1
## 16 AT1G01060 LHY
## 17 AT1G01060 LHY1
## 18 AT1G01070 UMAMIT28
## 19 AT1G01073 <NA>
## 20 AT1G01080 <NA>
## 21 AT1G01090 PDH-E1
myKeys <- c("CCA1", "LHY", "PRR7", "PRR9") # morning loop components
select(org.At.tair.db, keys = myKeys, columns = "ENTREZID", keytype = "SYMBOL")## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL ENTREZID
## 1 CCA1 819296
## 2 LHY 839341
## 3 PRR7 831793
## 4 PRR9 819292
myKeys <- head(keys(org.Fc.eg.db, keytype = "ENSEMBL"), 10)
myKeys## [1] "ENSFCAG00000000001" "ENSFCAG00000000007" "ENSFCAG00000000015" "ENSFCAG00000000022"
## [5] "ENSFCAG00000000023" "ENSFCAG00000000024" "ENSFCAG00000000028" "ENSFCAG00000000029"
## [9] "ENSFCAG00000000030" "ENSFCAG00000000031"
select(org.Fc.eg.db, keys = myKeys, columns = "SYMBOL", keytype = "ENSEMBL")## 'select()' returned 1:1 mapping between keys and columns
## ENSEMBL SYMBOL
## 1 ENSFCAG00000000001 INTS6L
## 2 ENSFCAG00000000007 HMGCR
## 3 ENSFCAG00000000015 CEP192
## 4 ENSFCAG00000000022 RASGRP1
## 5 ENSFCAG00000000023 GPR39
## 6 ENSFCAG00000000024 LYPD1
## 7 ENSFCAG00000000028 RCN3
## 8 ENSFCAG00000000029 APOO
## 9 ENSFCAG00000000030 CXHXorf58
## 10 ENSFCAG00000000031 CB1H4orf19
myKeys <- c("ASIP", "MC1R") # coat color patterns
select(org.Fc.eg.db, keys = myKeys, columns = c("ENSEMBL", "ENTREZID"), keytype = "SYMBOL")## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL ENSEMBL ENTREZID
## 1 ASIP ENSFCAG00000011037 492297
## 2 MC1R ENSFCAG00000003798 493917
sessionInfo()## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4/lib/libopenblasp-r0.3.12.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
## [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods
## [9] base
##
## other attached packages:
## [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 dbplyr_2.1.0
## [4] org.At.tair.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.1
## [7] S4Vectors_0.28.1 Biobase_2.50.0 BiocGenerics_0.36.0
## [10] knitr_1.31
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.21
## [3] bslib_0.2.4 BiocVersion_3.12.0
## [5] purrr_0.3.4 vctrs_0.3.6
## [7] generics_0.1.0 htmltools_0.5.1.1
## [9] yaml_2.2.1 utf8_1.1.4
## [11] interactiveDisplayBase_1.28.0 blob_1.2.1
## [13] rlang_0.4.10 later_1.1.0.1
## [15] jquerylib_0.1.3 pillar_1.5.0
## [17] withr_2.4.1 glue_1.4.2
## [19] DBI_1.1.1 rappdirs_0.3.3
## [21] bit64_4.0.5 lifecycle_1.0.0
## [23] stringr_1.4.0 memoise_2.0.0
## [25] evaluate_0.14 fastmap_1.1.0
## [27] httpuv_1.5.5 curl_4.3
## [29] fansi_0.4.2 Rcpp_1.0.6
## [31] xtable_1.8-4 promises_1.2.0.1
## [33] BiocManager_1.30.10 cachem_1.0.4
## [35] jsonlite_1.7.2 mime_0.10
## [37] bit_4.0.4 digest_0.6.27
## [39] stringi_1.5.3 shiny_1.6.0
## [41] dplyr_1.0.4 tools_4.0.3
## [43] magrittr_2.0.1 sass_0.3.1
## [45] RSQLite_2.2.3 tibble_3.1.0
## [47] crayon_1.4.1 pkgconfig_2.0.3
## [49] ellipsis_0.3.1 assertthat_0.2.1
## [51] rmarkdown_2.7 httr_1.4.2
## [53] R6_2.5.0 compiler_4.0.3