Bioconductor AnnotationData Packages: http://www.bioconductor.org/packages/release/data/annotation/
AnnotationHub:: https://bioconductor.org/packages/AnnotationHub/
License: GPL-3.0
There are many organism-level (org
) packages readily available on Bioconductor. They provide mappings between a central identifier (e.g. Entrez Gene identifiers) and other identifiers (e.g. ensembl ID, Refseq Identifiers, GO Identifiers, etc).
The name of an org
package is always of the form org.<Sp>.<id>.db
(e.g. org.Hs.eg.db
) where <Sp>
is a 2-letter abbreviation of the organism (e.g. Hs for Homo sapiens) and <id>
is an abbreviation (in lower-case) describing the type of central identifier (e.g. eg
for gene identifiers assigned by the Entrez Gene, or sgd
for Saccharomyces Genome Database). Most of the Bioconductor annotation packages are updated every 6 months.
R
cd /ngs/GO-Enrichment-Analysis-Demo
R
BiocManager
List available organism-level packages for installation in BiocManager.
::available("^org\\.") BiocManager
## [1] "org.Ag.eg.db" "org.At.tair.db" "org.Bt.eg.db" "org.Ce.eg.db"
## [5] "org.Cf.eg.db" "org.Dm.eg.db" "org.Dr.eg.db" "org.EcK12.eg.db"
## [9] "org.EcSakai.eg.db" "org.Gg.eg.db" "org.Hs.eg.db" "org.Mm.eg.db"
## [13] "org.Mmu.eg.db" "org.Mxanthus.db" "org.Pf.plasmo.db" "org.Pt.eg.db"
## [17] "org.Rn.eg.db" "org.Sc.sgd.db" "org.Ss.eg.db" "org.Xl.eg.db"
org
packageAs an example, let’s download and install the Arabidopsis thaliana (thale cress) package.
::install("org.At.tair.db") BiocManager
## Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
## Installing package(s) 'org.At.tair.db'
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
## Old packages: 'dplyr', 'ggforce', 'topGO'
library(org.At.tair.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
##
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport,
## clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply,
## parSapply, parSapplyLB
##
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
##
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
## is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int,
## pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit, which.max, which.min
##
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with 'browseVignettes()'. To
## cite Bioconductor, see 'citation("Biobase")', and for packages
## 'citation("pkgname")'.
##
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
##
## The following object is masked from 'package:base':
##
## expand.grid
org.At.tair.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: ARABIDOPSIS_DB
## | ORGANISM: Arabidopsis thaliana
## | SPECIES: Arabidopsis
## | TAIRSOURCENAME: Tair
## | TAIRSOURCEDATE: 2020-Sep28
## | TAIRSOURCEURL: https://www.arabidopsis.org/
## | TAIRGOURL: https://www.arabidopsis.org/download_files/GO_and_PO_Annotations/Gene_Ontology_Annotations/ATH_GO_GOSLIM.txt
## | TAIRGENEURL: https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_functional_descriptions
## | TAIRSYMBOLURL: https://www.arabidopsis.org/download_files/Public_Data_Releases/TAIR_Data_20190630/gene_aliases_20190630.txt.gz
## | TAIRPATHURL: ftp://ftp.plantcyc.org/Pathways/Data_dumps/PMN14_January2020/pathways/Ara_pathways.20200125
## | TAIRPMIDURL: https://www.arabidopsis.org/download_files/Public_Data_Releases/TAIR_Data_20190630/Locus_Published_20190630.txt.gz
## | TAIRCHRURL: https://www.arabidopsis.org/download_files/Maps/seqviewer_data/sv_gene.data
## | TAIRATHURL: https://www.arabidopsis.org/download_files/Microarrays/Affymetrix/affy_ATH1_array_elements-2010-12-20.txt
## | TAIRAGURL: https://www.arabidopsis.org/download_files/Microarrays/Affymetrix/affy_AG_array_elements-2010-12-20.txt
## | CENTRALID: TAIR
## | TAXID: 3702
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2020-09-10
## | GOEGSOURCEDATE: 2020-Sep23
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | EGSOURCEDATE: 2020-Sep23
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
##
## Please see: help('select') for usage information
AnnotationHub
Above method returns a limited number of organism-level annotation packages. There are a lot more packages available from the Bioconductor’s AnnotationHub service.
To search, download and install packages from the AnnotationHub service, install AnnotationHub
if it is not yet installed in your machine.
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
::install("AnnotationHub") BiocManager
## Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
## Installing package(s) 'AnnotationHub'
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
## Old packages: 'dplyr', 'ggforce', 'topGO'
AnnotationHub
objectlibrary(AnnotationHub)
## Loading required package: BiocFileCache
## Loading required package: dbplyr
##
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
##
## cache
AnnotationHub() ah <-
## using temporary cache /tmp/RtmpYREXDV/BiocFileCache
## snapshotDate(): 2020-10-27
# URL for the online AnnotationHub
hubUrl(ah)
## [1] "https://annotationhub.bioconductor.org"
ah
## AnnotationHub with 55232 records
## # snapshotDate(): 2020-10-27
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/,...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus, Pan trogl...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, ChainFile, TxDb, In...
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based,
## # maintainer, rdatadateadded, preparerclass, tags, rdatapath, sourceurl,
## # sourcetype
## # retrieve records with, e.g., 'object[["AH5012"]]'
##
## title
## AH5012 | Chromosome Band
## AH5013 | STS Markers
## AH5014 | FISH Clones
## AH5015 | Recomb Rate
## AH5016 | ENCODE Pilot
## ... ...
## AH89567 | Ensembl 103 EnsDb for Xiphophorus couchianus
## AH89568 | Ensembl 103 EnsDb for Xiphophorus maculatus
## AH89569 | Ensembl 103 EnsDb for Xenopus tropicalis
## AH89570 | Ensembl 103 EnsDb for Zonotrichia albicollis
## AH89571 | Ensembl 103 EnsDb for Zalophus californianus
# Number of resources
length(ah)
## [1] 55232
org
recordsSearch for organism-level packages with a pattern-matching string “^org\\.
”.
query(ah, "^org\\.")
db <- mcols(db)
df <-class(df)
## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"
Show query results stored in DFrame
.
# Column names
cbind(colnames(df))
## [,1]
## [1,] "title"
## [2,] "dataprovider"
## [3,] "species"
## [4,] "taxonomyid"
## [5,] "genome"
## [6,] "description"
## [7,] "coordinate_1_based"
## [8,] "maintainer"
## [9,] "rdatadateadded"
## [10,] "preparerclass"
## [11,] "tags"
## [12,] "rdataclass"
## [13,] "rdatapath"
## [14,] "sourceurl"
## [15,] "sourcetype"
# Number of org records
nrow(df)
## [1] 1695
# Show df
c("title", "species")] df[,
## DataFrame with 1695 rows and 2 columns
## title species
## <character> <character>
## AH84113 org.Ag.eg.db.sqlite Anopheles gambiae
## AH84114 org.At.tair.db.sqlite Arabidopsis thaliana
## AH84115 org.Bt.eg.db.sqlite Bos taurus
## AH84116 org.Cf.eg.db.sqlite Canis familiaris
## AH84117 org.Gg.eg.db.sqlite Gallus gallus
## ... ... ...
## AH87062 org.Schizosaccharomy.. Schizosaccharomyces ..
## AH87063 org.Burkholderia_ant.. Burkholderia anthina
## AH87064 org.Ascoidea_rubesce.. Ascoidea rubescens_D..
## AH87065 org.Burkholderia_pse.. Burkholderia pseudom..
## AH87066 org.Halogeometricum_.. Halogeometricum bori..
org
packageLet’s search and install the Felis catus (cat) package.
# Search df with keyword
data.frame(df[grep("Felis", df$species), c("title", "species", "rdatadateadded")])
## title species rdatadateadded
## AH85554 org.Felis_catus.eg.sqlite Felis catus 2020-10-27
## AH85555 org.Felis_domesticus.eg.sqlite Felis domesticus 2020-10-27
## AH85556 org.Felis_silvestris_catus.eg.sqlite Felis silvestris_catus 2020-10-27
## AH85835 org.Felis_canadensis.eg.sqlite Felis canadensis 2020-10-27
## AH86137 org.Felis_concolor.eg.sqlite Felis concolor 2020-10-27
# Retrieve package with for "Felis catus"
rownames(df[df$species == "Felis catus",])
rn <- ah[[rn]] org.Fc.eg.db <-
## downloading 1 resources
## retrieving 1 resource
## loading from cache
org.Fc.eg.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Felis catus
## | SPECIES: Felis catus
## | CENTRALID: GID
## | Taxonomy ID: 9685
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
##
## Please see: help('select') for usage information
recordStatus(ah, rn)
## record status dateadded
## 1 AH85554 Public 2020-10-27
After retrieving an annotation package, it will be placed in the local AnnotationHub cache. You can used it again without having to download the package.
# Location of the local AnnotationHub cache
hubCache(ah)
## [1] "~/.cache/AnnotationHub"
# Load from cache
ah[[rn]] org.Fc.eg.db <-
## loading from cache
You can use the removeCache
function to removes all local AnnotationHub database and all related resources.
removeCache(ah, ask = TRUE)
org
db objectscolumns
Shows which kinds of data can be returned for the AnnotationDb
object.
Both objects contain Gene Ontology mapping information.
columns(org.At.tair.db)
## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
columns(org.Fc.eg.db)
## [1] "ACCNUM" "ALIAS" "CHR" "ENSEMBL" "ENTREZID" "EVIDENCE"
## [7] "EVIDENCEALL" "GENENAME" "GID" "GO" "GOALL" "ONTOLOGY"
## [13] "ONTOLOGYALL" "PMID" "REFSEQ" "SYMBOL"
keytypes
Shows which columns can be used as keys.
keytypes(org.At.tair.db)
## [1] "ARACYC" "ARACYCENZYME" "ENTREZID" "ENZYME" "EVIDENCE"
## [6] "EVIDENCEALL" "GENENAME" "GO" "GOALL" "ONTOLOGY"
## [11] "ONTOLOGYALL" "PATH" "PMID" "REFSEQ" "SYMBOL"
## [16] "TAIR"
keytypes(org.Fc.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENTREZID" "EVIDENCE" "EVIDENCEALL"
## [7] "GENENAME" "GID" "GO" "GOALL" "ONTOLOGY" "ONTOLOGYALL"
## [13] "PMID" "REFSEQ" "SYMBOL"
keys
Returns values (or keys) that can be expected for a given keytype. By default it will return the primary keys for the database.
head(keys(org.At.tair.db), 10) # Primary keys
## [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060" "AT1G01070"
## [8] "AT1G01073" "AT1G01080" "AT1G01090"
head(keys(org.At.tair.db, keytype = "SYMBOL"), 10)
## [1] "ANAC001" "NAC001" "NTL10" "ARV1" "NGA3" "ASU1" "ATDCL1" "CAF"
## [9] "DCL1" "EMB60"
head(keys(org.At.tair.db, keytype = "GO"), 10)
## [1] "GO:0003700" "GO:0005634" "GO:0006355" "GO:0003674" "GO:0005739" "GO:0005783"
## [7] "GO:0005794" "GO:0006665" "GO:0009507" "GO:0016125"
head(keys(org.Fc.eg.db), 10) # Primary keys
## [1] "414734" "445455" "448843" "492297" "492308" "493648" "493649" "493650" "493651"
## [10] "493652"
head(keys(org.Fc.eg.db, keytype = "SYMBOL"), 10)
## [1] "A1BG" "A1CF" "A2M" "A2ML1" "A3GALT2" "A4GALT" "A4GNT" "AAAS"
## [9] "AACS" "AADAC"
head(keys(org.Fc.eg.db, keytype = "GO"), 10)
## [1] "GO:0000002" "GO:0000003" "GO:0000010" "GO:0000012" "GO:0000014" "GO:0000015"
## [7] "GO:0000026" "GO:0000027" "GO:0000028" "GO:0000030"
select
Retrieve the data as a data.frame
based on parameters for selected keys
, columns
and keytype
arguments.
head(keys(org.At.tair.db, keytype = "TAIR"), 10)
myKeys <- myKeys
## [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060" "AT1G01070"
## [8] "AT1G01073" "AT1G01080" "AT1G01090"
select(org.At.tair.db, keys = myKeys, columns = "SYMBOL", keytype = "TAIR")
## 'select()' returned 1:many mapping between keys and columns
## TAIR SYMBOL
## 1 AT1G01010 ANAC001
## 2 AT1G01010 NAC001
## 3 AT1G01010 NTL10
## 4 AT1G01020 ARV1
## 5 AT1G01030 NGA3
## 6 AT1G01040 ASU1
## 7 AT1G01040 ATDCL1
## 8 AT1G01040 CAF
## 9 AT1G01040 DCL1
## 10 AT1G01040 EMB60
## 11 AT1G01040 EMB76
## 12 AT1G01040 SIN1
## 13 AT1G01040 SUS1
## 14 AT1G01050 AtPPa1
## 15 AT1G01050 PPa1
## 16 AT1G01060 LHY
## 17 AT1G01060 LHY1
## 18 AT1G01070 UMAMIT28
## 19 AT1G01073 <NA>
## 20 AT1G01080 <NA>
## 21 AT1G01090 PDH-E1
c("CCA1", "LHY", "PRR7", "PRR9") # morning loop components
myKeys <-select(org.At.tair.db, keys = myKeys, columns = "ENTREZID", keytype = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL ENTREZID
## 1 CCA1 819296
## 2 LHY 839341
## 3 PRR7 831793
## 4 PRR9 819292
head(keys(org.Fc.eg.db, keytype = "ENSEMBL"), 10)
myKeys <- myKeys
## [1] "ENSFCAG00000000001" "ENSFCAG00000000007" "ENSFCAG00000000015" "ENSFCAG00000000022"
## [5] "ENSFCAG00000000023" "ENSFCAG00000000024" "ENSFCAG00000000028" "ENSFCAG00000000029"
## [9] "ENSFCAG00000000030" "ENSFCAG00000000031"
select(org.Fc.eg.db, keys = myKeys, columns = "SYMBOL", keytype = "ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
## ENSEMBL SYMBOL
## 1 ENSFCAG00000000001 INTS6L
## 2 ENSFCAG00000000007 HMGCR
## 3 ENSFCAG00000000015 CEP192
## 4 ENSFCAG00000000022 RASGRP1
## 5 ENSFCAG00000000023 GPR39
## 6 ENSFCAG00000000024 LYPD1
## 7 ENSFCAG00000000028 RCN3
## 8 ENSFCAG00000000029 APOO
## 9 ENSFCAG00000000030 CXHXorf58
## 10 ENSFCAG00000000031 CB1H4orf19
c("ASIP", "MC1R") # coat color patterns
myKeys <-select(org.Fc.eg.db, keys = myKeys, columns = c("ENSEMBL", "ENTREZID"), keytype = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
## SYMBOL ENSEMBL ENTREZID
## 1 ASIP ENSFCAG00000011037 492297
## 2 MC1R ENSFCAG00000003798 493917
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4/lib/libopenblasp-r0.3.12.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
## [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods
## [9] base
##
## other attached packages:
## [1] AnnotationHub_2.22.0 BiocFileCache_1.14.0 dbplyr_2.1.0
## [4] org.At.tair.db_3.12.0 AnnotationDbi_1.52.0 IRanges_2.24.1
## [7] S4Vectors_0.28.1 Biobase_2.50.0 BiocGenerics_0.36.0
## [10] knitr_1.31
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.1.0 xfun_0.21
## [3] bslib_0.2.4 BiocVersion_3.12.0
## [5] purrr_0.3.4 vctrs_0.3.6
## [7] generics_0.1.0 htmltools_0.5.1.1
## [9] yaml_2.2.1 utf8_1.1.4
## [11] interactiveDisplayBase_1.28.0 blob_1.2.1
## [13] rlang_0.4.10 later_1.1.0.1
## [15] jquerylib_0.1.3 pillar_1.5.0
## [17] withr_2.4.1 glue_1.4.2
## [19] DBI_1.1.1 rappdirs_0.3.3
## [21] bit64_4.0.5 lifecycle_1.0.0
## [23] stringr_1.4.0 memoise_2.0.0
## [25] evaluate_0.14 fastmap_1.1.0
## [27] httpuv_1.5.5 curl_4.3
## [29] fansi_0.4.2 Rcpp_1.0.6
## [31] xtable_1.8-4 promises_1.2.0.1
## [33] BiocManager_1.30.10 cachem_1.0.4
## [35] jsonlite_1.7.2 mime_0.10
## [37] bit_4.0.4 digest_0.6.27
## [39] stringi_1.5.3 shiny_1.6.0
## [41] dplyr_1.0.4 tools_4.0.3
## [43] magrittr_2.0.1 sass_0.3.1
## [45] RSQLite_2.2.3 tibble_3.1.0
## [47] crayon_1.4.1 pkgconfig_2.0.3
## [49] ellipsis_0.3.1 assertthat_0.2.1
## [51] rmarkdown_2.7 httr_1.4.2
## [53] R6_2.5.0 compiler_4.0.3