Bioconductor AnnotationData Packages: http://www.bioconductor.org/packages/release/data/annotation/

AnnotationHub:: https://bioconductor.org/packages/AnnotationHub/

License: GPL-3.0

Introduction

There are many organism-level (org) packages readily available on Bioconductor. They provide mappings between a central identifier (e.g. Entrez Gene identifiers) and other identifiers (e.g. ensembl ID, Refseq Identifiers, GO Identifiers, etc).

The name of an org package is always of the form org.<Sp>.<id>.db (e.g. org.Hs.eg.db) where <Sp> is a 2-letter abbreviation of the organism (e.g. Hs for Homo sapiens) and <id> is an abbreviation (in lower-case) describing the type of central identifier (e.g. eg for gene identifiers assigned by the Entrez Gene, or sgd for Saccharomyces Genome Database). Most of the Bioconductor annotation packages are updated every 6 months.

Start R

cd /ngs/GO-Enrichment-Analysis-Demo

R

Using BiocManager

List available organism-level packages for installation in BiocManager.

BiocManager::available("^org\\.")
##  [1] "org.Ag.eg.db"      "org.At.tair.db"    "org.Bt.eg.db"      "org.Ce.eg.db"     
##  [5] "org.Cf.eg.db"      "org.Dm.eg.db"      "org.Dr.eg.db"      "org.EcK12.eg.db"  
##  [9] "org.EcSakai.eg.db" "org.Gg.eg.db"      "org.Hs.eg.db"      "org.Mm.eg.db"     
## [13] "org.Mmu.eg.db"     "org.Mxanthus.db"   "org.Pf.plasmo.db"  "org.Pt.eg.db"     
## [17] "org.Rn.eg.db"      "org.Sc.sgd.db"     "org.Ss.eg.db"      "org.Xl.eg.db"

Install Arabidopsis org package

As an example, let’s download and install the Arabidopsis thaliana (thale cress) package.

BiocManager::install("org.At.tair.db")
## Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
## Installing package(s) 'org.At.tair.db'
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
## Old packages: 'dplyr', 'ggforce', 'topGO'
library(org.At.tair.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: BiocGenerics
## Loading required package: parallel
## 
## Attaching package: 'BiocGenerics'
## 
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ, clusterExport,
##     clusterMap, parApply, parCapply, parLapply, parLapplyLB, parRapply,
##     parSapply, parSapplyLB
## 
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## 
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, basename, cbind, colnames, dirname,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep, grepl, intersect,
##     is.unsorted, lapply, Map, mapply, match, mget, order, paste, pmax, pmax.int,
##     pmin, pmin.int, Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which.max, which.min
## 
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with 'browseVignettes()'. To
##     cite Bioconductor, see 'citation("Biobase")', and for packages
##     'citation("pkgname")'.
## 
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## 
## The following object is masked from 'package:base':
## 
##     expand.grid
org.At.tair.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## | DBSCHEMA: ARABIDOPSIS_DB
## | ORGANISM: Arabidopsis thaliana
## | SPECIES: Arabidopsis
## | TAIRSOURCENAME: Tair
## | TAIRSOURCEDATE: 2020-Sep28
## | TAIRSOURCEURL: https://www.arabidopsis.org/
## | TAIRGOURL: https://www.arabidopsis.org/download_files/GO_and_PO_Annotations/Gene_Ontology_Annotations/ATH_GO_GOSLIM.txt
## | TAIRGENEURL: https://www.arabidopsis.org/download_files/Genes/TAIR10_genome_release/TAIR10_functional_descriptions
## | TAIRSYMBOLURL: https://www.arabidopsis.org/download_files/Public_Data_Releases/TAIR_Data_20190630/gene_aliases_20190630.txt.gz
## | TAIRPATHURL: ftp://ftp.plantcyc.org/Pathways/Data_dumps/PMN14_January2020/pathways/Ara_pathways.20200125
## | TAIRPMIDURL: https://www.arabidopsis.org/download_files/Public_Data_Releases/TAIR_Data_20190630/Locus_Published_20190630.txt.gz
## | TAIRCHRURL: https://www.arabidopsis.org/download_files/Maps/seqviewer_data/sv_gene.data
## | TAIRATHURL: https://www.arabidopsis.org/download_files/Microarrays/Affymetrix/affy_ATH1_array_elements-2010-12-20.txt
## | TAIRAGURL: https://www.arabidopsis.org/download_files/Microarrays/Affymetrix/affy_AG_array_elements-2010-12-20.txt
## | CENTRALID: TAIR
## | TAXID: 3702
## | KEGGSOURCENAME: KEGG GENOME
## | KEGGSOURCEURL: ftp://ftp.genome.jp/pub/kegg/genomes
## | KEGGSOURCEDATE: 2011-Mar15
## | GOSOURCENAME: Gene Ontology
## | GOSOURCEURL: http://current.geneontology.org/ontology/go-basic.obo
## | GOSOURCEDATE: 2020-09-10
## | GOEGSOURCEDATE: 2020-Sep23
## | GOEGSOURCENAME: Entrez Gene
## | GOEGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## | EGSOURCEDATE: 2020-Sep23
## | EGSOURCENAME: Entrez Gene
## | EGSOURCEURL: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA
## 
## Please see: help('select') for usage information

Using AnnotationHub

Above method returns a limited number of organism-level annotation packages. There are a lot more packages available from the Bioconductor’s AnnotationHub service.

To search, download and install packages from the AnnotationHub service, install AnnotationHub if it is not yet installed in your machine.

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("AnnotationHub")
## Bioconductor version 3.12 (BiocManager 1.30.10), R 4.0.3 (2020-10-10)
## Installing package(s) 'AnnotationHub'
## Updating HTML index of packages in '.Library'
## Making 'packages.html' ... done
## Old packages: 'dplyr', 'ggforce', 'topGO'

Create an AnnotationHub object

library(AnnotationHub)
## Loading required package: BiocFileCache
## Loading required package: dbplyr
## 
## Attaching package: 'AnnotationHub'
## The following object is masked from 'package:Biobase':
## 
##     cache
ah <- AnnotationHub()
## using temporary cache /tmp/RtmpYREXDV/BiocFileCache
## snapshotDate(): 2020-10-27
# URL for the online AnnotationHub
hubUrl(ah)
## [1] "https://annotationhub.bioconductor.org"

Summary of available records

ah
## AnnotationHub with 55232 records
## # snapshotDate(): 2020-10-27
## # $dataprovider: Ensembl, BroadInstitute, UCSC, ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/,...
## # $species: Homo sapiens, Mus musculus, Drosophila melanogaster, Bos taurus, Pan trogl...
## # $rdataclass: GRanges, TwoBitFile, BigWigFile, EnsDb, Rle, OrgDb, ChainFile, TxDb, In...
## # additional mcols(): taxonomyid, genome, description, coordinate_1_based,
## #   maintainer, rdatadateadded, preparerclass, tags, rdatapath, sourceurl,
## #   sourcetype 
## # retrieve records with, e.g., 'object[["AH5012"]]' 
## 
##             title                                       
##   AH5012  | Chromosome Band                             
##   AH5013  | STS Markers                                 
##   AH5014  | FISH Clones                                 
##   AH5015  | Recomb Rate                                 
##   AH5016  | ENCODE Pilot                                
##   ...       ...                                         
##   AH89567 | Ensembl 103 EnsDb for Xiphophorus couchianus
##   AH89568 | Ensembl 103 EnsDb for Xiphophorus maculatus 
##   AH89569 | Ensembl 103 EnsDb for Xenopus tropicalis    
##   AH89570 | Ensembl 103 EnsDb for Zonotrichia albicollis
##   AH89571 | Ensembl 103 EnsDb for Zalophus californianus
# Number of resources
length(ah)
## [1] 55232

Query the hub for org records

Search for organism-level packages with a pattern-matching string “^org\\.”.

db <- query(ah, "^org\\.")
df <- mcols(db)
class(df)
## [1] "DFrame"
## attr(,"package")
## [1] "S4Vectors"

Show query results

Show query results stored in DFrame.

# Column names
cbind(colnames(df))
##       [,1]                
##  [1,] "title"             
##  [2,] "dataprovider"      
##  [3,] "species"           
##  [4,] "taxonomyid"        
##  [5,] "genome"            
##  [6,] "description"       
##  [7,] "coordinate_1_based"
##  [8,] "maintainer"        
##  [9,] "rdatadateadded"    
## [10,] "preparerclass"     
## [11,] "tags"              
## [12,] "rdataclass"        
## [13,] "rdatapath"         
## [14,] "sourceurl"         
## [15,] "sourcetype"
# Number of org records
nrow(df)
## [1] 1695
# Show df
df[,c("title", "species")]
## DataFrame with 1695 rows and 2 columns
##                          title                species
##                    <character>            <character>
## AH84113    org.Ag.eg.db.sqlite      Anopheles gambiae
## AH84114  org.At.tair.db.sqlite   Arabidopsis thaliana
## AH84115    org.Bt.eg.db.sqlite             Bos taurus
## AH84116    org.Cf.eg.db.sqlite       Canis familiaris
## AH84117    org.Gg.eg.db.sqlite          Gallus gallus
## ...                        ...                    ...
## AH87062 org.Schizosaccharomy.. Schizosaccharomyces ..
## AH87063 org.Burkholderia_ant..   Burkholderia anthina
## AH87064 org.Ascoidea_rubesce.. Ascoidea rubescens_D..
## AH87065 org.Burkholderia_pse.. Burkholderia pseudom..
## AH87066 org.Halogeometricum_.. Halogeometricum bori..

Download Felis org package

Let’s search and install the Felis catus (cat) package.

# Search df with keyword
data.frame(df[grep("Felis", df$species), c("title", "species", "rdatadateadded")])
##                                        title                species rdatadateadded
## AH85554            org.Felis_catus.eg.sqlite            Felis catus     2020-10-27
## AH85555       org.Felis_domesticus.eg.sqlite       Felis domesticus     2020-10-27
## AH85556 org.Felis_silvestris_catus.eg.sqlite Felis silvestris_catus     2020-10-27
## AH85835       org.Felis_canadensis.eg.sqlite       Felis canadensis     2020-10-27
## AH86137         org.Felis_concolor.eg.sqlite         Felis concolor     2020-10-27
# Retrieve package with for "Felis catus"
rn <- rownames(df[df$species == "Felis catus",])
org.Fc.eg.db <- ah[[rn]]
## downloading 1 resources
## retrieving 1 resource
## loading from cache
org.Fc.eg.db
## OrgDb object:
## | DBSCHEMAVERSION: 2.1
## | DBSCHEMA: NOSCHEMA_DB
## | ORGANISM: Felis catus
## | SPECIES: Felis catus
## | CENTRALID: GID
## | Taxonomy ID: 9685
## | Db type: OrgDb
## | Supporting package: AnnotationDbi
## 
## Please see: help('select') for usage information

Show record status

recordStatus(ah, rn)
##    record status  dateadded
## 1 AH85554 Public 2020-10-27

Load from local cache

After retrieving an annotation package, it will be placed in the local AnnotationHub cache. You can used it again without having to download the package.

# Location of the local AnnotationHub cache
hubCache(ah)
## [1] "~/.cache/AnnotationHub"
# Load from cache
org.Fc.eg.db <- ah[[rn]]
## loading from cache

Clear local cache

You can use the removeCache function to removes all local AnnotationHub database and all related resources.

removeCache(ah, ask = TRUE)

Discover org db objects

columns

Shows which kinds of data can be returned for the AnnotationDb object.

Both objects contain Gene Ontology mapping information.

columns(org.At.tair.db)
##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"
columns(org.Fc.eg.db)
##  [1] "ACCNUM"      "ALIAS"       "CHR"         "ENSEMBL"     "ENTREZID"    "EVIDENCE"   
##  [7] "EVIDENCEALL" "GENENAME"    "GID"         "GO"          "GOALL"       "ONTOLOGY"   
## [13] "ONTOLOGYALL" "PMID"        "REFSEQ"      "SYMBOL"

keytypes

Shows which columns can be used as keys.

keytypes(org.At.tair.db)
##  [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
##  [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
## [11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
## [16] "TAIR"
keytypes(org.Fc.eg.db)
##  [1] "ACCNUM"      "ALIAS"       "ENSEMBL"     "ENTREZID"    "EVIDENCE"    "EVIDENCEALL"
##  [7] "GENENAME"    "GID"         "GO"          "GOALL"       "ONTOLOGY"    "ONTOLOGYALL"
## [13] "PMID"        "REFSEQ"      "SYMBOL"

keys

Returns values (or keys) that can be expected for a given keytype. By default it will return the primary keys for the database.

head(keys(org.At.tair.db), 10)  # Primary keys
##  [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060" "AT1G01070"
##  [8] "AT1G01073" "AT1G01080" "AT1G01090"
head(keys(org.At.tair.db, keytype = "SYMBOL"), 10)
##  [1] "ANAC001" "NAC001"  "NTL10"   "ARV1"    "NGA3"    "ASU1"    "ATDCL1"  "CAF"    
##  [9] "DCL1"    "EMB60"
head(keys(org.At.tair.db, keytype = "GO"), 10)
##  [1] "GO:0003700" "GO:0005634" "GO:0006355" "GO:0003674" "GO:0005739" "GO:0005783"
##  [7] "GO:0005794" "GO:0006665" "GO:0009507" "GO:0016125"
head(keys(org.Fc.eg.db), 10)    # Primary keys
##  [1] "414734" "445455" "448843" "492297" "492308" "493648" "493649" "493650" "493651"
## [10] "493652"
head(keys(org.Fc.eg.db, keytype = "SYMBOL"), 10)
##  [1] "A1BG"    "A1CF"    "A2M"     "A2ML1"   "A3GALT2" "A4GALT"  "A4GNT"   "AAAS"   
##  [9] "AACS"    "AADAC"
head(keys(org.Fc.eg.db, keytype = "GO"), 10)
##  [1] "GO:0000002" "GO:0000003" "GO:0000010" "GO:0000012" "GO:0000014" "GO:0000015"
##  [7] "GO:0000026" "GO:0000027" "GO:0000028" "GO:0000030"

select

Retrieve the data as a data.frame based on parameters for selected keys, columns and keytype arguments.

Ex1: Given TAIR ID, retrieves SYMBOL

myKeys <- head(keys(org.At.tair.db, keytype = "TAIR"), 10)
myKeys
##  [1] "AT1G01010" "AT1G01020" "AT1G01030" "AT1G01040" "AT1G01050" "AT1G01060" "AT1G01070"
##  [8] "AT1G01073" "AT1G01080" "AT1G01090"
select(org.At.tair.db, keys = myKeys, columns = "SYMBOL", keytype = "TAIR")
## 'select()' returned 1:many mapping between keys and columns
##         TAIR   SYMBOL
## 1  AT1G01010  ANAC001
## 2  AT1G01010   NAC001
## 3  AT1G01010    NTL10
## 4  AT1G01020     ARV1
## 5  AT1G01030     NGA3
## 6  AT1G01040     ASU1
## 7  AT1G01040   ATDCL1
## 8  AT1G01040      CAF
## 9  AT1G01040     DCL1
## 10 AT1G01040    EMB60
## 11 AT1G01040    EMB76
## 12 AT1G01040     SIN1
## 13 AT1G01040     SUS1
## 14 AT1G01050   AtPPa1
## 15 AT1G01050     PPa1
## 16 AT1G01060      LHY
## 17 AT1G01060     LHY1
## 18 AT1G01070 UMAMIT28
## 19 AT1G01073     <NA>
## 20 AT1G01080     <NA>
## 21 AT1G01090   PDH-E1

Ex2: Given SYMBOL, retrieves ENTREZID ID

myKeys <- c("CCA1", "LHY", "PRR7", "PRR9") # morning loop components
select(org.At.tair.db, keys = myKeys, columns = "ENTREZID", keytype = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
##   SYMBOL ENTREZID
## 1   CCA1   819296
## 2    LHY   839341
## 3   PRR7   831793
## 4   PRR9   819292

Ex3: Given ENSEMBL ID, retrieves SYMBOL

myKeys <- head(keys(org.Fc.eg.db, keytype = "ENSEMBL"), 10)
myKeys
##  [1] "ENSFCAG00000000001" "ENSFCAG00000000007" "ENSFCAG00000000015" "ENSFCAG00000000022"
##  [5] "ENSFCAG00000000023" "ENSFCAG00000000024" "ENSFCAG00000000028" "ENSFCAG00000000029"
##  [9] "ENSFCAG00000000030" "ENSFCAG00000000031"
select(org.Fc.eg.db, keys = myKeys, columns = "SYMBOL", keytype = "ENSEMBL")
## 'select()' returned 1:1 mapping between keys and columns
##               ENSEMBL     SYMBOL
## 1  ENSFCAG00000000001     INTS6L
## 2  ENSFCAG00000000007      HMGCR
## 3  ENSFCAG00000000015     CEP192
## 4  ENSFCAG00000000022    RASGRP1
## 5  ENSFCAG00000000023      GPR39
## 6  ENSFCAG00000000024      LYPD1
## 7  ENSFCAG00000000028       RCN3
## 8  ENSFCAG00000000029       APOO
## 9  ENSFCAG00000000030  CXHXorf58
## 10 ENSFCAG00000000031 CB1H4orf19

Ex4: Given SYMBOL, retrieves ENSEMBL ID and ENTREZID ID

myKeys <- c("ASIP", "MC1R") # coat color patterns
select(org.Fc.eg.db, keys = myKeys, columns = c("ENSEMBL", "ENTREZID"), keytype = "SYMBOL")
## 'select()' returned 1:1 mapping between keys and columns
##   SYMBOL            ENSEMBL ENTREZID
## 1   ASIP ENSFCAG00000011037   492297
## 2   MC1R ENSFCAG00000003798   493917

Session information

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3/envs/r4/lib/libopenblasp-r0.3.12.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
##  [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods  
## [9] base     
## 
## other attached packages:
##  [1] AnnotationHub_2.22.0  BiocFileCache_1.14.0  dbplyr_2.1.0         
##  [4] org.At.tair.db_3.12.0 AnnotationDbi_1.52.0  IRanges_2.24.1       
##  [7] S4Vectors_0.28.1      Biobase_2.50.0        BiocGenerics_0.36.0  
## [10] knitr_1.31           
## 
## loaded via a namespace (and not attached):
##  [1] tidyselect_1.1.0              xfun_0.21                    
##  [3] bslib_0.2.4                   BiocVersion_3.12.0           
##  [5] purrr_0.3.4                   vctrs_0.3.6                  
##  [7] generics_0.1.0                htmltools_0.5.1.1            
##  [9] yaml_2.2.1                    utf8_1.1.4                   
## [11] interactiveDisplayBase_1.28.0 blob_1.2.1                   
## [13] rlang_0.4.10                  later_1.1.0.1                
## [15] jquerylib_0.1.3               pillar_1.5.0                 
## [17] withr_2.4.1                   glue_1.4.2                   
## [19] DBI_1.1.1                     rappdirs_0.3.3               
## [21] bit64_4.0.5                   lifecycle_1.0.0              
## [23] stringr_1.4.0                 memoise_2.0.0                
## [25] evaluate_0.14                 fastmap_1.1.0                
## [27] httpuv_1.5.5                  curl_4.3                     
## [29] fansi_0.4.2                   Rcpp_1.0.6                   
## [31] xtable_1.8-4                  promises_1.2.0.1             
## [33] BiocManager_1.30.10           cachem_1.0.4                 
## [35] jsonlite_1.7.2                mime_0.10                    
## [37] bit_4.0.4                     digest_0.6.27                
## [39] stringi_1.5.3                 shiny_1.6.0                  
## [41] dplyr_1.0.4                   tools_4.0.3                  
## [43] magrittr_2.0.1                sass_0.3.1                   
## [45] RSQLite_2.2.3                 tibble_3.1.0                 
## [47] crayon_1.4.1                  pkgconfig_2.0.3              
## [49] ellipsis_0.3.1                assertthat_0.2.1             
## [51] rmarkdown_2.7                 httr_1.4.2                   
## [53] R6_2.5.0                      compiler_4.0.3