16S-rDNA-V3-V4 repository: https://github.com/ycl6/16S-rDNA-V3-V4
cutadapt: GitHub, Documentation, Paper
DADA2: GitHub, Documentation, Paper
Demo Dataset: PRJEB27564 from Gut Microbiota in Parkinson’s Disease: Temporal Stability and Relations to Disease Progression. EBioMedicine. 2019;44:691-707
License: GPL-3.0
The PCR primer sequences can be removed using any primer or adapter trimming softwares. The cutadapt software offers options to specify primer/adapter sequences from both 5’ and 3’ ends and also from second read of both ends, making it an ideal choice to perform this task.
The parameters -a and -g specify adapters to be removed from each read. The -A and -G options work like their -a and -g counterparts, but are applied to the second read in each pair. So the trimming parameters would be like this:
cutadapt \
-g 5'-end-adapter-seq \
-a 3'-end-adapter-seq \
-G 5'-end-adapter-from-second-read-seq \
-A 3'-end-adapter-from-second-read-seq
For a 341F (5’-CCTACGGGNGGCWGCAG-3’) and 785R (5’-GACTACHVGGGTATCTAATCC-3’) primer set, the trimming parameters would be like this:
cutadapt \
-g CCTACGGGNGGCWGCAG \ # 341F seq
-a GGATTAGATACCCBDGTAGTC \ # 785R reverse complement seq
-G GACTACHVGGGTATCTAATCC \ # 785R seq
-A CTGCWGCCNCCCGTAGG # 341F reverse complement seq
You can use the included Perl script run_trimming.pl to perform primer trimming. It would determine the reverse complement sequences of the primer set you provided and perform trimming.
In the below example, the raw fastq files are placed in the raw folder and the trimmed fastq files will be placed in the pre-defined folder named trimmed.
cd /ngs/16S-Demo
# Usage: perl run_trimming.pl project_folder fastq_folder forward_primer_seq reverse_primer_seq
perl run_trimming.pl PRJEB27564 raw CCTACGGGNGGCWGCAG GACTACHVGGGTATCTAATCC
Or follow the instruction below to run cutadapt within R.
Rcd /ngs/16S-Demo/PRJEB27564
R
library("dada2")
library("data.table")
library("DECIPHER")
library("phangorn")
library("phyloseq")
library("ggplot2")Change the fastq path to correspond to the location of the raw fastq files.
fastq = "raw" # raw fastq files
trimmed = "trimmed" # cutadapt trimmed fastq files
filt = "filt" # dada2 trimmed fastq files
outfiles = "outfiles" # output files
images = "images" # output images
if(!dir.exists(filt)) dir.create(filt)
if(!dir.exists(outfiles)) dir.create(outfiles)
if(!dir.exists(images)) dir.create(images)head(list.files(fastq),15)## [1] "download.sh" "ERR2730149_1.fastq.gz" "ERR2730149_2.fastq.gz"
## [4] "ERR2730151_1.fastq.gz" "ERR2730151_2.fastq.gz" "ERR2730154_1.fastq.gz"
## [7] "ERR2730154_2.fastq.gz" "ERR2730155_1.fastq.gz" "ERR2730155_2.fastq.gz"
## [10] "ERR2730158_1.fastq.gz" "ERR2730158_2.fastq.gz" "ERR2730162_1.fastq.gz"
## [13] "ERR2730162_2.fastq.gz" "ERR2730163_1.fastq.gz" "ERR2730163_2.fastq.gz"
The trimming program cutadapt is called using system2 function to perform trimming.
Note: Skip this step if trimmed has been performed
fns = sort(list.files(fastq, full.names = TRUE))
fnFs = fns[grep("1.fastq.gz", fns)]
fnRs = fns[grep("2.fastq.gz", fns)]
if(!dir.exists(trimmed)) dir.create(trimmed)
fnFs.cut = file.path(trimmed, basename(fnFs))
fnRs.cut = file.path(trimmed, basename(fnRs))
log.cut = gsub(".1.fastq.gz", ".log", fnFs.cut)
sample.names = gsub(".1.fastq.gz", "", basename(fnFs.cut))
# Define the primer set used to perform PCR
FWD = "CCTACGGGNGGCWGCAG" # 341F
REV = "GACTACHVGGGTATCTAATCC" # 785R
# Get reverse complement DNA sequences
FWD.RC = dada2::rc(FWD)
REV.RC = dada2::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags = paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags = paste("-G", REV, "-A", FWD.RC)
# Run cutadapt to remove primers
# Note to change the PATH to cutadapt accordingly
cutadapt = "/path-to-bin/cutadapt"
for(i in seq_along(fnFs)) {
print(paste0("[", i ,"/", length(sample.names), "] ", sample.names[i]))
system2(cutadapt,
stdout = log.cut[i], stderr = log.cut[i], # log file
args = c(R1.flags, R2.flags,
"-n 2", # -n 2 required to remove FWD and REV from reads
"--match-read-wildcards", # enable IUPAC nucleotide codes (wildcard characters)
"--length 300", # Truncate reads to 300 bp
"-m 150", # discard reads shorter than LEN (avoid length zero sequences)
"--overlap 10", # min overlap between read and adapter for an adapter to be found
"-j 0", # auto-detection of CPU cores, only available on Python 3
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # trimmed files
fnFs[i], fnRs[i]) # input files
)
}head(list.files(trimmed),15)## [1] "ERR2730149_1.fastq.gz" "ERR2730149_2.fastq.gz" "ERR2730149.log"
## [4] "ERR2730151_1.fastq.gz" "ERR2730151_2.fastq.gz" "ERR2730151.log"
## [7] "ERR2730154_1.fastq.gz" "ERR2730154_2.fastq.gz" "ERR2730154.log"
## [10] "ERR2730155_1.fastq.gz" "ERR2730155_2.fastq.gz" "ERR2730155.log"
## [13] "ERR2730158_1.fastq.gz" "ERR2730158_2.fastq.gz" "ERR2730158.log"
fns = sort(list.files(trimmed, full.names = TRUE))
fnFs = fns[grep("1.fastq.gz", fns)] # Update the grep pattern when necessary
fnRs = fns[grep("2.fastq.gz", fns)] # Update the grep pattern when necessary
sample.names = gsub(".1.fastq.gz", "", basename(fnFs)) # Update the gsub pattern when necessary
# Check objects
fnFs## [1] "trimmed/ERR2730149_1.fastq.gz" "trimmed/ERR2730151_1.fastq.gz"
## [3] "trimmed/ERR2730154_1.fastq.gz" "trimmed/ERR2730155_1.fastq.gz"
## [5] "trimmed/ERR2730158_1.fastq.gz" "trimmed/ERR2730162_1.fastq.gz"
## [7] "trimmed/ERR2730163_1.fastq.gz" "trimmed/ERR2730165_1.fastq.gz"
## [9] "trimmed/ERR2730166_1.fastq.gz" "trimmed/ERR2730170_1.fastq.gz"
## [11] "trimmed/ERR2730171_1.fastq.gz" "trimmed/ERR2730174_1.fastq.gz"
## [13] "trimmed/ERR2730178_1.fastq.gz" "trimmed/ERR2730179_1.fastq.gz"
## [15] "trimmed/ERR2730180_1.fastq.gz" "trimmed/ERR2730181_1.fastq.gz"
## [17] "trimmed/ERR2730184_1.fastq.gz" "trimmed/ERR2730187_1.fastq.gz"
## [19] "trimmed/ERR2730189_1.fastq.gz" "trimmed/ERR2730190_1.fastq.gz"
## [21] "trimmed/ERR2730193_1.fastq.gz" "trimmed/ERR2730194_1.fastq.gz"
## [23] "trimmed/ERR2730196_1.fastq.gz" "trimmed/ERR2730197_1.fastq.gz"
## [25] "trimmed/ERR2730198_1.fastq.gz" "trimmed/ERR2730199_1.fastq.gz"
## [27] "trimmed/ERR2730203_1.fastq.gz" "trimmed/ERR2730204_1.fastq.gz"
## [29] "trimmed/ERR2730206_1.fastq.gz" "trimmed/ERR2730209_1.fastq.gz"
## [31] "trimmed/ERR2730211_1.fastq.gz" "trimmed/ERR2730213_1.fastq.gz"
## [33] "trimmed/ERR2730214_1.fastq.gz" "trimmed/ERR2730216_1.fastq.gz"
## [35] "trimmed/ERR2730217_1.fastq.gz" "trimmed/ERR2730219_1.fastq.gz"
## [37] "trimmed/ERR2730223_1.fastq.gz" "trimmed/ERR2730225_1.fastq.gz"
## [39] "trimmed/ERR2730232_1.fastq.gz" "trimmed/ERR2730233_1.fastq.gz"
## [41] "trimmed/ERR2730234_1.fastq.gz" "trimmed/ERR2730235_1.fastq.gz"
## [43] "trimmed/ERR2730236_1.fastq.gz" "trimmed/ERR2730237_1.fastq.gz"
## [45] "trimmed/ERR2730243_1.fastq.gz" "trimmed/ERR2730246_1.fastq.gz"
## [47] "trimmed/ERR2730247_1.fastq.gz" "trimmed/ERR2730248_1.fastq.gz"
## [49] "trimmed/ERR2730249_1.fastq.gz" "trimmed/ERR2730253_1.fastq.gz"
## [51] "trimmed/ERR2730256_1.fastq.gz" "trimmed/ERR2730258_1.fastq.gz"
## [53] "trimmed/ERR2730259_1.fastq.gz" "trimmed/ERR2730261_1.fastq.gz"
## [55] "trimmed/ERR2730265_1.fastq.gz" "trimmed/ERR2730267_1.fastq.gz"
## [57] "trimmed/ERR2730269_1.fastq.gz" "trimmed/ERR2730270_1.fastq.gz"
## [59] "trimmed/ERR2730272_1.fastq.gz" "trimmed/ERR2730274_1.fastq.gz"
## [61] "trimmed/ERR2730275_1.fastq.gz" "trimmed/ERR2730277_1.fastq.gz"
## [63] "trimmed/ERR2730279_1.fastq.gz" "trimmed/ERR2730282_1.fastq.gz"
## [65] "trimmed/ERR2730283_1.fastq.gz" "trimmed/ERR2730286_1.fastq.gz"
## [67] "trimmed/ERR2730290_1.fastq.gz" "trimmed/ERR2730291_1.fastq.gz"
## [69] "trimmed/ERR2730293_1.fastq.gz" "trimmed/ERR2730294_1.fastq.gz"
## [71] "trimmed/ERR2730298_1.fastq.gz" "trimmed/ERR2730299_1.fastq.gz"
## [73] "trimmed/ERR2730302_1.fastq.gz" "trimmed/ERR2730306_1.fastq.gz"
## [75] "trimmed/ERR2730307_1.fastq.gz" "trimmed/ERR2730308_1.fastq.gz"
## [77] "trimmed/ERR2730309_1.fastq.gz" "trimmed/ERR2730312_1.fastq.gz"
## [79] "trimmed/ERR2730315_1.fastq.gz" "trimmed/ERR2730317_1.fastq.gz"
## [81] "trimmed/ERR2730318_1.fastq.gz" "trimmed/ERR2730321_1.fastq.gz"
## [83] "trimmed/ERR2730322_1.fastq.gz" "trimmed/ERR2730324_1.fastq.gz"
## [85] "trimmed/ERR2730325_1.fastq.gz" "trimmed/ERR2730326_1.fastq.gz"
## [87] "trimmed/ERR2730327_1.fastq.gz" "trimmed/ERR2730331_1.fastq.gz"
## [89] "trimmed/ERR2730332_1.fastq.gz" "trimmed/ERR2730334_1.fastq.gz"
## [91] "trimmed/ERR2730337_1.fastq.gz" "trimmed/ERR2730339_1.fastq.gz"
## [93] "trimmed/ERR2730341_1.fastq.gz" "trimmed/ERR2730342_1.fastq.gz"
## [95] "trimmed/ERR2730344_1.fastq.gz" "trimmed/ERR2730345_1.fastq.gz"
## [97] "trimmed/ERR2730347_1.fastq.gz" "trimmed/ERR2730351_1.fastq.gz"
## [99] "trimmed/ERR2730353_1.fastq.gz" "trimmed/ERR2730360_1.fastq.gz"
## [101] "trimmed/ERR2730361_1.fastq.gz" "trimmed/ERR2730362_1.fastq.gz"
## [103] "trimmed/ERR2730363_1.fastq.gz" "trimmed/ERR2730364_1.fastq.gz"
## [105] "trimmed/ERR2730365_1.fastq.gz" "trimmed/ERR2730371_1.fastq.gz"
## [107] "trimmed/ERR2730374_1.fastq.gz" "trimmed/ERR2730375_1.fastq.gz"
## [109] "trimmed/ERR2730376_1.fastq.gz" "trimmed/ERR2730377_1.fastq.gz"
## [111] "trimmed/ERR2730381_1.fastq.gz" "trimmed/ERR2730384_1.fastq.gz"
## [113] "trimmed/ERR2730386_1.fastq.gz" "trimmed/ERR2730387_1.fastq.gz"
## [115] "trimmed/ERR2730389_1.fastq.gz" "trimmed/ERR2730393_1.fastq.gz"
## [117] "trimmed/ERR2730395_1.fastq.gz" "trimmed/ERR2730397_1.fastq.gz"
## [119] "trimmed/ERR2730398_1.fastq.gz" "trimmed/ERR2730400_1.fastq.gz"
## [121] "trimmed/ERR2730402_1.fastq.gz" "trimmed/ERR2730403_1.fastq.gz"
## [123] "trimmed/ERR3040013_1.fastq.gz" "trimmed/ERR3040014_1.fastq.gz"
## [125] "trimmed/ERR3040017_1.fastq.gz" "trimmed/ERR3040018_1.fastq.gz"
fnRs## [1] "trimmed/ERR2730149_2.fastq.gz" "trimmed/ERR2730151_2.fastq.gz"
## [3] "trimmed/ERR2730154_2.fastq.gz" "trimmed/ERR2730155_2.fastq.gz"
## [5] "trimmed/ERR2730158_2.fastq.gz" "trimmed/ERR2730162_2.fastq.gz"
## [7] "trimmed/ERR2730163_2.fastq.gz" "trimmed/ERR2730165_2.fastq.gz"
## [9] "trimmed/ERR2730166_2.fastq.gz" "trimmed/ERR2730170_2.fastq.gz"
## [11] "trimmed/ERR2730171_2.fastq.gz" "trimmed/ERR2730174_2.fastq.gz"
## [13] "trimmed/ERR2730178_2.fastq.gz" "trimmed/ERR2730179_2.fastq.gz"
## [15] "trimmed/ERR2730180_2.fastq.gz" "trimmed/ERR2730181_2.fastq.gz"
## [17] "trimmed/ERR2730184_2.fastq.gz" "trimmed/ERR2730187_2.fastq.gz"
## [19] "trimmed/ERR2730189_2.fastq.gz" "trimmed/ERR2730190_2.fastq.gz"
## [21] "trimmed/ERR2730193_2.fastq.gz" "trimmed/ERR2730194_2.fastq.gz"
## [23] "trimmed/ERR2730196_2.fastq.gz" "trimmed/ERR2730197_2.fastq.gz"
## [25] "trimmed/ERR2730198_2.fastq.gz" "trimmed/ERR2730199_2.fastq.gz"
## [27] "trimmed/ERR2730203_2.fastq.gz" "trimmed/ERR2730204_2.fastq.gz"
## [29] "trimmed/ERR2730206_2.fastq.gz" "trimmed/ERR2730209_2.fastq.gz"
## [31] "trimmed/ERR2730211_2.fastq.gz" "trimmed/ERR2730213_2.fastq.gz"
## [33] "trimmed/ERR2730214_2.fastq.gz" "trimmed/ERR2730216_2.fastq.gz"
## [35] "trimmed/ERR2730217_2.fastq.gz" "trimmed/ERR2730219_2.fastq.gz"
## [37] "trimmed/ERR2730223_2.fastq.gz" "trimmed/ERR2730225_2.fastq.gz"
## [39] "trimmed/ERR2730232_2.fastq.gz" "trimmed/ERR2730233_2.fastq.gz"
## [41] "trimmed/ERR2730234_2.fastq.gz" "trimmed/ERR2730235_2.fastq.gz"
## [43] "trimmed/ERR2730236_2.fastq.gz" "trimmed/ERR2730237_2.fastq.gz"
## [45] "trimmed/ERR2730243_2.fastq.gz" "trimmed/ERR2730246_2.fastq.gz"
## [47] "trimmed/ERR2730247_2.fastq.gz" "trimmed/ERR2730248_2.fastq.gz"
## [49] "trimmed/ERR2730249_2.fastq.gz" "trimmed/ERR2730253_2.fastq.gz"
## [51] "trimmed/ERR2730256_2.fastq.gz" "trimmed/ERR2730258_2.fastq.gz"
## [53] "trimmed/ERR2730259_2.fastq.gz" "trimmed/ERR2730261_2.fastq.gz"
## [55] "trimmed/ERR2730265_2.fastq.gz" "trimmed/ERR2730267_2.fastq.gz"
## [57] "trimmed/ERR2730269_2.fastq.gz" "trimmed/ERR2730270_2.fastq.gz"
## [59] "trimmed/ERR2730272_2.fastq.gz" "trimmed/ERR2730274_2.fastq.gz"
## [61] "trimmed/ERR2730275_2.fastq.gz" "trimmed/ERR2730277_2.fastq.gz"
## [63] "trimmed/ERR2730279_2.fastq.gz" "trimmed/ERR2730282_2.fastq.gz"
## [65] "trimmed/ERR2730283_2.fastq.gz" "trimmed/ERR2730286_2.fastq.gz"
## [67] "trimmed/ERR2730290_2.fastq.gz" "trimmed/ERR2730291_2.fastq.gz"
## [69] "trimmed/ERR2730293_2.fastq.gz" "trimmed/ERR2730294_2.fastq.gz"
## [71] "trimmed/ERR2730298_2.fastq.gz" "trimmed/ERR2730299_2.fastq.gz"
## [73] "trimmed/ERR2730302_2.fastq.gz" "trimmed/ERR2730306_2.fastq.gz"
## [75] "trimmed/ERR2730307_2.fastq.gz" "trimmed/ERR2730308_2.fastq.gz"
## [77] "trimmed/ERR2730309_2.fastq.gz" "trimmed/ERR2730312_2.fastq.gz"
## [79] "trimmed/ERR2730315_2.fastq.gz" "trimmed/ERR2730317_2.fastq.gz"
## [81] "trimmed/ERR2730318_2.fastq.gz" "trimmed/ERR2730321_2.fastq.gz"
## [83] "trimmed/ERR2730322_2.fastq.gz" "trimmed/ERR2730324_2.fastq.gz"
## [85] "trimmed/ERR2730325_2.fastq.gz" "trimmed/ERR2730326_2.fastq.gz"
## [87] "trimmed/ERR2730327_2.fastq.gz" "trimmed/ERR2730331_2.fastq.gz"
## [89] "trimmed/ERR2730332_2.fastq.gz" "trimmed/ERR2730334_2.fastq.gz"
## [91] "trimmed/ERR2730337_2.fastq.gz" "trimmed/ERR2730339_2.fastq.gz"
## [93] "trimmed/ERR2730341_2.fastq.gz" "trimmed/ERR2730342_2.fastq.gz"
## [95] "trimmed/ERR2730344_2.fastq.gz" "trimmed/ERR2730345_2.fastq.gz"
## [97] "trimmed/ERR2730347_2.fastq.gz" "trimmed/ERR2730351_2.fastq.gz"
## [99] "trimmed/ERR2730353_2.fastq.gz" "trimmed/ERR2730360_2.fastq.gz"
## [101] "trimmed/ERR2730361_2.fastq.gz" "trimmed/ERR2730362_2.fastq.gz"
## [103] "trimmed/ERR2730363_2.fastq.gz" "trimmed/ERR2730364_2.fastq.gz"
## [105] "trimmed/ERR2730365_2.fastq.gz" "trimmed/ERR2730371_2.fastq.gz"
## [107] "trimmed/ERR2730374_2.fastq.gz" "trimmed/ERR2730375_2.fastq.gz"
## [109] "trimmed/ERR2730376_2.fastq.gz" "trimmed/ERR2730377_2.fastq.gz"
## [111] "trimmed/ERR2730381_2.fastq.gz" "trimmed/ERR2730384_2.fastq.gz"
## [113] "trimmed/ERR2730386_2.fastq.gz" "trimmed/ERR2730387_2.fastq.gz"
## [115] "trimmed/ERR2730389_2.fastq.gz" "trimmed/ERR2730393_2.fastq.gz"
## [117] "trimmed/ERR2730395_2.fastq.gz" "trimmed/ERR2730397_2.fastq.gz"
## [119] "trimmed/ERR2730398_2.fastq.gz" "trimmed/ERR2730400_2.fastq.gz"
## [121] "trimmed/ERR2730402_2.fastq.gz" "trimmed/ERR2730403_2.fastq.gz"
## [123] "trimmed/ERR3040013_2.fastq.gz" "trimmed/ERR3040014_2.fastq.gz"
## [125] "trimmed/ERR3040017_2.fastq.gz" "trimmed/ERR3040018_2.fastq.gz"
sample.names## [1] "ERR2730149" "ERR2730151" "ERR2730154" "ERR2730155" "ERR2730158" "ERR2730162"
## [7] "ERR2730163" "ERR2730165" "ERR2730166" "ERR2730170" "ERR2730171" "ERR2730174"
## [13] "ERR2730178" "ERR2730179" "ERR2730180" "ERR2730181" "ERR2730184" "ERR2730187"
## [19] "ERR2730189" "ERR2730190" "ERR2730193" "ERR2730194" "ERR2730196" "ERR2730197"
## [25] "ERR2730198" "ERR2730199" "ERR2730203" "ERR2730204" "ERR2730206" "ERR2730209"
## [31] "ERR2730211" "ERR2730213" "ERR2730214" "ERR2730216" "ERR2730217" "ERR2730219"
## [37] "ERR2730223" "ERR2730225" "ERR2730232" "ERR2730233" "ERR2730234" "ERR2730235"
## [43] "ERR2730236" "ERR2730237" "ERR2730243" "ERR2730246" "ERR2730247" "ERR2730248"
## [49] "ERR2730249" "ERR2730253" "ERR2730256" "ERR2730258" "ERR2730259" "ERR2730261"
## [55] "ERR2730265" "ERR2730267" "ERR2730269" "ERR2730270" "ERR2730272" "ERR2730274"
## [61] "ERR2730275" "ERR2730277" "ERR2730279" "ERR2730282" "ERR2730283" "ERR2730286"
## [67] "ERR2730290" "ERR2730291" "ERR2730293" "ERR2730294" "ERR2730298" "ERR2730299"
## [73] "ERR2730302" "ERR2730306" "ERR2730307" "ERR2730308" "ERR2730309" "ERR2730312"
## [79] "ERR2730315" "ERR2730317" "ERR2730318" "ERR2730321" "ERR2730322" "ERR2730324"
## [85] "ERR2730325" "ERR2730326" "ERR2730327" "ERR2730331" "ERR2730332" "ERR2730334"
## [91] "ERR2730337" "ERR2730339" "ERR2730341" "ERR2730342" "ERR2730344" "ERR2730345"
## [97] "ERR2730347" "ERR2730351" "ERR2730353" "ERR2730360" "ERR2730361" "ERR2730362"
## [103] "ERR2730363" "ERR2730364" "ERR2730365" "ERR2730371" "ERR2730374" "ERR2730375"
## [109] "ERR2730376" "ERR2730377" "ERR2730381" "ERR2730384" "ERR2730386" "ERR2730387"
## [115] "ERR2730389" "ERR2730393" "ERR2730395" "ERR2730397" "ERR2730398" "ERR2730400"
## [121] "ERR2730402" "ERR2730403" "ERR3040013" "ERR3040014" "ERR3040017" "ERR3040018"
We use the plotQualityProfile function to plot the quality profiles of the trimmed fastq files.
# Plot quality profile of fastq files
ii = 1:length(sample.names)
pdf(paste0(images, "/plotQualityProfile.pdf"), width = 8, height = 8, pointsize = 12)
for(i in ii) {
message(paste0("[", i ,"/", length(sample.names), "] ", sample.names[i]))
print(plotQualityProfile(fnFs[i]) + ggtitle("Fwd"))
print(plotQualityProfile(fnRs[i]) + ggtitle("Rev"))
}
invisible(dev.off())We review the quality profiles, the quality distribution of forward reads appears to drop after position 260 and position 200 for reverse reads. We will use these positions to truncate forward and reverse reads respectively in the next step.
Remember to review plotQualityProfile.pdf to select the best paramters for truncLen argument.
# Set paths to the dada2-filterd files
filtFs = file.path(filt, basename(fnFs))
filtRs = file.path(filt, basename(fnRs))
# Perform filtering and trimming
out = filterAndTrim(fnFs, filtFs, fnRs, filtRs,
# Need to keep paramters consistent between runs of the same study
truncLen = c(260,200), minLen = 200, maxN = 0, truncQ = 2, maxEE = c(2,5),
rm.phix = TRUE, compress = TRUE, verbose = TRUE, multithread = TRUE)
out = as.data.frame(out)
rownames(out) = sample.nameshead(out, 10)## reads.in reads.out
## ERR2730149 65747 58829
## ERR2730151 70743 61547
## ERR2730154 91558 81374
## ERR2730155 58955 53269
## ERR2730158 128246 116847
## ERR2730162 134565 121999
## ERR2730163 116621 105093
## ERR2730165 57401 51513
## ERR2730166 121037 108838
## ERR2730170 57359 51276
Use the learnErrors function to perform dereplication and learn the error rates. The derepFastq function used in past workflow has been intergrated into learnErrors function.
errF = learnErrors(filtFs, multithread = TRUE)## 128404900 total bases in 493865 reads from 6 samples will be used for learning the error rates.
errR = learnErrors(filtRs, multithread = TRUE)## 119791600 total bases in 598958 reads from 7 samples will be used for learning the error rates.
# Visualize the estimated error rates
pdf(paste0(images, "/plotErrors.pdf"), width = 10, height = 10, pointsize = 12)
plotErrors(errF, nominalQ = TRUE)
plotErrors(errR, nominalQ = TRUE)
invisible(dev.off())Forward reads
Reverse reads
We now apply the core sample inference algorithm to the filtered and trimmed sequence data.
Note: By default, the
dadafunction processes each sample independently (i.e.pool = FALSE), if your samples are from an extremely diverse community (e.g. soil), pooling (i.e.pool = TRUE) or pseudo-pooling (recommended; i.e.pool = pseudo) might help in identifying the rare ASVs in each sample
dadaFs = dada(filtFs, err = errF, pool = FALSE, multithread = TRUE)
dadaRs = dada(filtRs, err = errR, pool = FALSE, multithread = TRUE)We now merge the forward and reverse reads together to obtain the full denoised sequences.
mergers = mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)We now construct an amplicon sequence variant (ASV) table.
seqtab = makeSequenceTable(mergers)View the length frequency distribution with table.
table(nchar(getSequences(seqtab)))##
## 260 261 264 267 270 272 273 275 277 278 279 281 282 284 286
## 461 7 1 1 7 3 7 2 10 7 11 14 1 1 1
## 288 300 301 313 317 320 325 335 341 365 384 385 386 400 401
## 1 2 1 1 2 1 2 1 1 1 4 2 3 2 724
## 402 403 404 405 406 407 408 409 410 411 412 413 417 418 419
## 11302 2584 4587 4221 126 257 112 124 34 264 13 1 2 8 101
## 420 421 422 423 424 425 426 427 428 429 430 431 432 433 438
## 67 531 3606 326 82 68 206 1082 452 7 12 1 8 3 1
## 439 440 441 442 443 444 445 446 447 448
## 24 7 13 10 9 10 4 6 5 7
Save sequence table.
saveRDS(seqtab, "seqtab.rds")
# Or as an example, save only the first 5 samples
# saveRDS(seqtab[c(1:5),], "seqtab.rds")This section elaborates a little about merging results from multiple sequencing run. You can merge sequence tables from multiple runs belonging to the same experiment or project before moving on to chimera removal and taxonomy assignment. In the example snippet below, we use mergeSequenceTables to combine 2 sequence tables.
Note: Skip this step if you do not require merging multiple sequence tables
# Load sequence tables from multiple runs
seqtab1 = readRDS("/some-path/run1/seqtab.rds")
seqtab2 = readRDS("/some-path/run2/seqtab.rds")
# Check the dimensions and sample names
dim(seqtab1)
rownames(seqtab1)
dim(seqtab2)
rownames(seqtab2)
# Merge sequence table and remove chimeras
seqtab = mergeSequenceTables(seqtab1, seqtab2)
# To sum values of same sample from multiple sequence table (i.e. when a sample was re-sequenced due to low depth)
# seqtab = mergeSequenceTables(seqtab1, seqtab2, repeats = "sum")
# Check the dimension of the merged sequence table
dim(seqtab)After obtaining the merged sequence table, you can continue with the below data processing steps.
seqtab.nochim = removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE,
verbose = TRUE)## Identified 23851 bimeras out of 31565 input sequences.
# Update the same names to exclude "fastq.gz" in name
rownames(seqtab.nochim) = sample.namesView the dimensions of seqtab and seqtab.nochim.
dim(seqtab)## [1] 126 31565
dim(seqtab.nochim)## [1] 126 7714
Print the proportion of non-chimeras in merged sequence reads.
sum(seqtab.nochim)/sum(seqtab)## [1] 0.9683915
getN <- function(x) sum(getUniques(x))
track = cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN),
sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) = c("Trimmed", "Filtered", "denoisedF", "denoisedR", "merged", "nonchim")
track = cbind(data.frame(SampleID = sample.names), track)
write.table(track, file = paste0(outfiles, "/track.txt"), sep = "\t", quote = F,
row.names = F, col.names = T)Set the full-path to the Silva and NCBI database.
dbpath = "/path-to-db/"
ref1 = paste0(dbpath, "silva_nr99_v138.1_train_set.fa.gz")
ref2 = paste0(dbpath, "silva_species_assignment_v138.1.fa.gz")
ref3 = paste0(dbpath, "16SMicrobial.fa.gz")Use the assignTaxonomy function to classifies sequences against SILVA reference training dataset ref1, and use the assignSpecies function to perform taxonomic assignment to the species level by exact matching against SILVA ref2 and NCBI reference datasets ref3.
# Extracts the sequences
seqs = getSequences(seqtab.nochim)
set.seed(12345)
taxtab = assignTaxonomy(seqs, refFasta = ref1, minBoot = 80, tryRC = TRUE, verbose = TRUE,
outputBootstraps = TRUE, multithread = TRUE)## Finished processing reference fasta.
spec_silva = assignSpecies(seqs, ref2, allowMultiple = FALSE, tryRC = TRUE, verbose = TRUE)## 473 out of 7714 were assigned to the species level.
spec_ncbi = assignSpecies(seqs, ref3, allowMultiple = FALSE, tryRC = TRUE, verbose = TRUE)## 355 out of 7714 were assigned to the species level.
Combine species-level taxonomic assignment from 2 reference sources.
SVformat = paste("%0",nchar(as.character(ncol(seqtab.nochim))),"d", sep = "")
svid = paste0("SV", sprintf(SVformat, seq(ncol(seqtab.nochim))))
s_silva = as.data.frame(spec_silva, stringsAsFactors = FALSE)
rownames(s_silva) = svid
s_ncbi = as.data.frame(spec_ncbi, stringsAsFactors = FALSE)
rownames(s_ncbi) = svid
s_ncbi$Genus = gsub("\\[|\\]", "", s_ncbi$Genus)
s_merged = cbind(s_ncbi, s_silva)
colnames(s_merged) = c("nGenus","nSpecies","sGenus","sSpecies")
s_merged1 = s_merged[!is.na(s_merged$nSpecies),]
colnames(s_merged1)[1:2] = c("Genus","Species")
s_merged2 = s_merged[is.na(s_merged$nSpecies) & !is.na(s_merged$sSpecies),]
colnames(s_merged2)[3:4] = c("Genus","Species")
s_merged3 = s_merged[is.na(s_merged$nSpecies) & is.na(s_merged$sSpecies),]
colnames(s_merged3)[3:4] = c("Genus","Species")
s_final = rbind(s_merged1[,c("Genus","Species")], s_merged2[,c("Genus","Species")],
s_merged3[,c("Genus","Species")])
s_final = s_final[order(row.names(s_final)),]
s_final = as.matrix(s_final)
if("Genus" %in% colnames(taxtab$tax)) {
gcol = which(colnames(taxtab$tax) == "Genus")
} else {
gcol = ncol(taxtab$tax)
}
matchGenera <- function(gen.tax, gen.binom, split.glyph = "/") {
if(is.na(gen.tax) || is.na(gen.binom)) { return(FALSE) }
if((gen.tax == gen.binom) ||
grepl(paste0("^", gen.binom, "[ _", split.glyph, "]"), gen.tax) ||
grepl(paste0(split.glyph, gen.binom, "$"), gen.tax)) {
return(TRUE)
} else {
return(FALSE)
}
}
gen.match = mapply(matchGenera, taxtab$tax[,gcol], s_final[,1])
taxtab$tax = cbind(taxtab$tax, s_final[,2])
colnames(taxtab$tax)[ncol(taxtab$tax)] = "Species"
print(paste(sum(!is.na(s_final[,2])), "out of",
nrow(s_final), "were assigned to the species level."))## [1] "587 out of 7714 were assigned to the species level."
taxtab$tax[!gen.match,"Species"] = NA
print(paste("Of which", sum(!is.na(taxtab$tax[,"Species"])),
"had genera consistent with the input table."))## [1] "Of which 476 had genera consistent with the input table."
Prepare a data.frame df from seqtab.nochim and taxtab.
df = data.frame(sequence = seqs, abundance = colSums(seqtab.nochim), stringsAsFactors = FALSE)
df$id = svid
df = merge(df, as.data.frame(taxtab), by = "row.names")
rownames(df) = df$id
df = df[order(df$id),2:ncol(df)]Performs alignment of multiple unaligned sequences.
alignment = AlignSeqs(Biostrings::DNAStringSet(setNames(df$sequence, df$id)), anchor = NA)## Determining distance matrix based on shared 8-mers:
## ==========================================================================================
##
## Time difference of 479.97 secs
##
## Clustering into groups by similarity:
## ==========================================================================================
##
## Time difference of 7.04 secs
##
## Aligning Sequences:
## ==========================================================================================
##
## Time difference of 85.3 secs
##
## Iteration 1 of 2:
##
## Determining distance matrix based on alignment:
## ==========================================================================================
##
## Time difference of 76.33 secs
##
## Reclustering into groups by similarity:
## ==========================================================================================
##
## Time difference of 5.85 secs
##
## Realigning Sequences:
## ==========================================================================================
##
## Time difference of 61.57 secs
##
## Iteration 2 of 2:
##
## Determining distance matrix based on alignment:
## ==========================================================================================
##
## Time difference of 66.86 secs
##
## Reclustering into groups by similarity:
## ==========================================================================================
##
## Time difference of 5.87 secs
##
## Realigning Sequences:
## ==========================================================================================
##
## Time difference of 14.13 secs
##
## Refining the alignment:
## ==========================================================================================
##
## Time difference of 0.92 secs
Export alignment.
phang.align = phyDat(as(alignment, "matrix"), type = "DNA")
write.phyDat(phang.align, file = "alignment.fasta", format = "fasta")
write.phyDat(phang.align, file = "alignment.aln", format = "phylip")Set the full-path to the RAxML and RAxML-NG.
raxml = "/path-to-raxml-binary" # e.g. /usr/local/bin/raxmlHPC-PTHREADS-SSE3
raxmlng = "/path-to-raxml-ng-binary" # e.g. /usr/local/bin/raxml-ngRun RAxML and RAxML-NG.
system2(raxml, args = c("-T 2", "-f E", "-p 1234", "-x 5678", "-m GTRCAT", "-N 1",
"-s alignment.aln", "-n raxml_tree_GTRCAT"))
system2(raxmlng, args = c("--evaluate", "--force", "--seed 1234", "--log progress", "--threads 2",
"--msa alignment.fasta", "--model GTR+G", "--brlen scaled",
"--tree RAxML_fastTree.raxml_tree_GTRCAT", "--prefix GTRCAT"))Import tree using the read_tree function from phyloseq.
raxml_tree = read_tree("GTRCAT.raxml.bestTree")samdf = data.frame(fread("sample.meta", colClasses = "character"))
rownames(samdf) = samdf$Sample_ID
samdf$Sample_ID = as.factor(samdf$Sample_ID)
samdf$Sample_ID = factor(samdf$Sample_ID, levels = c(sort(levels(samdf$Sample_ID), decreasing = F)))
samdf$Subject = as.factor(samdf$Subject)
samdf$Group2 = paste(samdf$Group, samdf$Time, sep = "_")
samdf$Group = as.factor(samdf$Group)
samdf$Group2 = as.factor(samdf$Group2)
samdf$Time = as.factor(samdf$Time)
head(samdf)## Run_ID Sample_ID Sample_title Subject Group Time Group2
## C0005B ERR2730149 C0005B C0005 baseline C0005 control baseline control_baseline
## C0009B ERR2730151 C0009B C0009 baseline C0009 control baseline control_baseline
## C0019B ERR2730154 C0019B C0019 baseline C0019 control baseline control_baseline
## C0020B ERR2730155 C0020B C0020 baseline C0020 control baseline control_baseline
## C0024B ERR2730158 C0024B C0024 baseline C0024 control baseline control_baseline
## C0032B ERR2730162 C0032B C0032 baseline C0032 control baseline control_baseline
phyloseqPrepare new_seqtab and tax data.frames containing abundance and taxonomy information respectively.
new_seqtab = seqtab.nochim
colnames(new_seqtab) = df[match(colnames(new_seqtab), df$sequence),]$id
# Update rownames in new_seqtab from Run_ID to Sample_ID
rownames(new_seqtab) = as.character(samdf[match(rownames(new_seqtab), samdf$Run_ID),]$Sample_ID)
new_taxtab = taxtab
rownames(new_taxtab$tax) = df[match(rownames(new_taxtab$tax), df$sequence),]$id
tax = as.data.frame(new_taxtab$tax)
tax$Family = as.character(tax$Family)
tax$Genus = as.character(tax$Genus)Construct a phyloseq object.
ps = phyloseq(tax_table(as.matrix(tax)),
sample_data(samdf),
otu_table(new_seqtab, taxa_are_rows = FALSE),
phy_tree(raxml_tree))
ps## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 7714 taxa and 126 samples ]
## sample_data() Sample Data: [ 126 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 7714 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 7714 tips and 7712 internal nodes ]
save.image(file = "1_dada2_tutorial.RData")sessionInfo()## R version 4.1.0 (2021-05-18)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3_4.10.0/envs/r4-16S/lib/libopenblasp-r0.3.15.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
## [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods
## [9] base
##
## other attached packages:
## [1] ggplot2_3.3.4 phyloseq_1.36.0 phangorn_2.7.0 ape_5.5
## [5] DECIPHER_2.20.0 RSQLite_2.2.5 Biostrings_2.60.0 GenomeInfoDb_1.28.0
## [9] XVector_0.32.0 IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0
## [13] data.table_1.14.0 dada2_1.20.0 Rcpp_1.0.6 knitr_1.33
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-152 bitops_1.0-7 matrixStats_0.59.0
## [4] bit64_4.0.5 RColorBrewer_1.1-2 tools_4.1.0
## [7] vegan_2.5-7 utf8_1.2.1 R6_2.5.0
## [10] mgcv_1.8-36 DBI_1.1.1 colorspace_2.0-2
## [13] permute_0.9-5 rhdf5filters_1.4.0 ade4_1.7-17
## [16] withr_2.4.2 gridExtra_2.3 tidyselect_1.1.1
## [19] bit_4.0.4 compiler_4.1.0 Biobase_2.52.0
## [22] DelayedArray_0.18.0 labeling_0.4.2 scales_1.1.1
## [25] quadprog_1.5-8 stringr_1.4.0 digest_0.6.27
## [28] Rsamtools_2.8.0 rmarkdown_2.9 jpeg_0.1-8.1
## [31] pkgconfig_2.0.3 htmltools_0.5.1.1 MatrixGenerics_1.4.0
## [34] highr_0.9 fastmap_1.1.0 rlang_0.4.11
## [37] farver_2.1.0 generics_0.1.0 hwriter_1.3.2
## [40] jsonlite_1.7.2 BiocParallel_1.26.0 dplyr_1.0.7
## [43] RCurl_1.98-1.3 magrittr_2.0.1 GenomeInfoDbData_1.2.6
## [46] biomformat_1.20.0 Matrix_1.3-4 Rhdf5lib_1.14.0
## [49] munsell_0.5.0 fansi_0.4.2 lifecycle_1.0.0
## [52] stringi_1.6.2 yaml_2.2.1 MASS_7.3-54
## [55] SummarizedExperiment_1.22.0 zlibbioc_1.38.0 rhdf5_2.36.0
## [58] plyr_1.8.6 grid_4.1.0 blob_1.2.1
## [61] crayon_1.4.1 lattice_0.20-44 splines_4.1.0
## [64] multtest_2.48.0 pillar_1.6.1 igraph_1.2.6
## [67] GenomicRanges_1.44.0 reshape2_1.4.4 codetools_0.2-18
## [70] fastmatch_1.1-0 glue_1.4.2 evaluate_0.14
## [73] ShortRead_1.50.0 latticeExtra_0.6-29 RcppParallel_5.1.4
## [76] png_0.1-7 vctrs_0.3.8 foreach_1.5.1
## [79] gtable_0.3.0 purrr_0.3.4 assertthat_0.2.1
## [82] cachem_1.0.5 xfun_0.24 survival_3.2-11
## [85] tibble_3.1.2 iterators_1.0.13 GenomicAlignments_1.28.0
## [88] memoise_2.0.0 cluster_2.1.2 ellipsis_0.3.2