16S-rDNA-V3-V4 repository: https://github.com/ycl6/16S-rDNA-V3-V4
cutadapt: GitHub, Documentation, Paper
DADA2: GitHub, Documentation, Paper
Demo Dataset: PRJEB27564 from Gut Microbiota in Parkinson’s Disease: Temporal Stability and Relations to Disease Progression. EBioMedicine. 2019;44:691-707
License: GPL-3.0
The PCR primer sequences can be removed using any primer or adapter trimming softwares. The cutadapt
software offers options to specify primer/adapter sequences from both 5’ and 3’ ends and also from second read of both ends, making it an ideal choice to perform this task.
The parameters -a
and -g
specify adapters to be removed from each read. The -A
and -G
options work like their -a
and -g
counterparts, but are applied to the second read in each pair. So the trimming parameters would be like this:
cutadapt \
-g 5'-end-adapter-seq \
-a 3'-end-adapter-seq \
-G 5'-end-adapter-from-second-read-seq \
-A 3'-end-adapter-from-second-read-seq
For a 341F (5’-CCTACGGGNGGCWGCAG-3’) and 785R (5’-GACTACHVGGGTATCTAATCC-3’) primer set, the trimming parameters would be like this:
cutadapt \
-g CCTACGGGNGGCWGCAG \ # 341F seq
-a GGATTAGATACCCBDGTAGTC \ # 785R reverse complement seq
-G GACTACHVGGGTATCTAATCC \ # 785R seq
-A CTGCWGCCNCCCGTAGG # 341F reverse complement seq
You can use the included Perl script run_trimming.pl
to perform primer trimming. It would determine the reverse complement sequences of the primer set you provided and perform trimming.
In the below example, the raw fastq files are placed in the raw
folder and the trimmed fastq files will be placed in the pre-defined folder named trimmed
.
cd /ngs/16S-Demo
# Usage: perl run_trimming.pl project_folder fastq_folder forward_primer_seq reverse_primer_seq
perl run_trimming.pl PRJEB27564 raw CCTACGGGNGGCWGCAG GACTACHVGGGTATCTAATCC
Or follow the instruction below to run cutadapt
within R.
R
cd /ngs/16S-Demo/PRJEB27564
R
library("dada2")
library("data.table")
library("DECIPHER")
library("phangorn")
library("phyloseq")
library("ggplot2")
Change the fastq
path to correspond to the location of the raw fastq files.
"raw" # raw fastq files
fastq = "trimmed" # cutadapt trimmed fastq files
trimmed = "filt" # dada2 trimmed fastq files
filt = "outfiles" # output files
outfiles = "images" # output images
images =
if(!dir.exists(filt)) dir.create(filt)
if(!dir.exists(outfiles)) dir.create(outfiles)
if(!dir.exists(images)) dir.create(images)
head(list.files(fastq),15)
## [1] "download.sh" "ERR2730149_1.fastq.gz" "ERR2730149_2.fastq.gz"
## [4] "ERR2730151_1.fastq.gz" "ERR2730151_2.fastq.gz" "ERR2730154_1.fastq.gz"
## [7] "ERR2730154_2.fastq.gz" "ERR2730155_1.fastq.gz" "ERR2730155_2.fastq.gz"
## [10] "ERR2730158_1.fastq.gz" "ERR2730158_2.fastq.gz" "ERR2730162_1.fastq.gz"
## [13] "ERR2730162_2.fastq.gz" "ERR2730163_1.fastq.gz" "ERR2730163_2.fastq.gz"
The trimming program cutadapt
is called using system2
function to perform trimming.
Note: Skip this step if trimmed has been performed
sort(list.files(fastq, full.names = TRUE))
fns = fns[grep("1.fastq.gz", fns)]
fnFs = fns[grep("2.fastq.gz", fns)]
fnRs =
if(!dir.exists(trimmed)) dir.create(trimmed)
file.path(trimmed, basename(fnFs))
fnFs.cut = file.path(trimmed, basename(fnRs))
fnRs.cut = gsub(".1.fastq.gz", ".log", fnFs.cut)
log.cut = gsub(".1.fastq.gz", "", basename(fnFs.cut))
sample.names =
# Define the primer set used to perform PCR
"CCTACGGGNGGCWGCAG" # 341F
FWD = "GACTACHVGGGTATCTAATCC" # 785R
REV =
# Get reverse complement DNA sequences
dada2::rc(FWD)
FWD.RC = dada2::rc(REV)
REV.RC =
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
paste("-g", FWD, "-a", REV.RC)
R1.flags =# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
paste("-G", REV, "-A", FWD.RC)
R2.flags =
# Run cutadapt to remove primers
# Note to change the PATH to cutadapt accordingly
"/path-to-bin/cutadapt"
cutadapt =
for(i in seq_along(fnFs)) {
print(paste0("[", i ,"/", length(sample.names), "] ", sample.names[i]))
system2(cutadapt,
stdout = log.cut[i], stderr = log.cut[i], # log file
args = c(R1.flags, R2.flags,
"-n 2", # -n 2 required to remove FWD and REV from reads
"--match-read-wildcards", # enable IUPAC nucleotide codes (wildcard characters)
"--length 300", # Truncate reads to 300 bp
"-m 150", # discard reads shorter than LEN (avoid length zero sequences)
"--overlap 10", # min overlap between read and adapter for an adapter to be found
"-j 0", # auto-detection of CPU cores, only available on Python 3
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # trimmed files
# input files
fnFs[i], fnRs[i])
) }
head(list.files(trimmed),15)
## [1] "ERR2730149_1.fastq.gz" "ERR2730149_2.fastq.gz" "ERR2730149.log"
## [4] "ERR2730151_1.fastq.gz" "ERR2730151_2.fastq.gz" "ERR2730151.log"
## [7] "ERR2730154_1.fastq.gz" "ERR2730154_2.fastq.gz" "ERR2730154.log"
## [10] "ERR2730155_1.fastq.gz" "ERR2730155_2.fastq.gz" "ERR2730155.log"
## [13] "ERR2730158_1.fastq.gz" "ERR2730158_2.fastq.gz" "ERR2730158.log"
sort(list.files(trimmed, full.names = TRUE))
fns = fns[grep("1.fastq.gz", fns)] # Update the grep pattern when necessary
fnFs = fns[grep("2.fastq.gz", fns)] # Update the grep pattern when necessary
fnRs = gsub(".1.fastq.gz", "", basename(fnFs)) # Update the gsub pattern when necessary
sample.names =
# Check objects
fnFs
## [1] "trimmed/ERR2730149_1.fastq.gz" "trimmed/ERR2730151_1.fastq.gz"
## [3] "trimmed/ERR2730154_1.fastq.gz" "trimmed/ERR2730155_1.fastq.gz"
## [5] "trimmed/ERR2730158_1.fastq.gz" "trimmed/ERR2730162_1.fastq.gz"
## [7] "trimmed/ERR2730163_1.fastq.gz" "trimmed/ERR2730165_1.fastq.gz"
## [9] "trimmed/ERR2730166_1.fastq.gz" "trimmed/ERR2730170_1.fastq.gz"
## [11] "trimmed/ERR2730171_1.fastq.gz" "trimmed/ERR2730174_1.fastq.gz"
## [13] "trimmed/ERR2730178_1.fastq.gz" "trimmed/ERR2730179_1.fastq.gz"
## [15] "trimmed/ERR2730180_1.fastq.gz" "trimmed/ERR2730181_1.fastq.gz"
## [17] "trimmed/ERR2730184_1.fastq.gz" "trimmed/ERR2730187_1.fastq.gz"
## [19] "trimmed/ERR2730189_1.fastq.gz" "trimmed/ERR2730190_1.fastq.gz"
## [21] "trimmed/ERR2730193_1.fastq.gz" "trimmed/ERR2730194_1.fastq.gz"
## [23] "trimmed/ERR2730196_1.fastq.gz" "trimmed/ERR2730197_1.fastq.gz"
## [25] "trimmed/ERR2730198_1.fastq.gz" "trimmed/ERR2730199_1.fastq.gz"
## [27] "trimmed/ERR2730203_1.fastq.gz" "trimmed/ERR2730204_1.fastq.gz"
## [29] "trimmed/ERR2730206_1.fastq.gz" "trimmed/ERR2730209_1.fastq.gz"
## [31] "trimmed/ERR2730211_1.fastq.gz" "trimmed/ERR2730213_1.fastq.gz"
## [33] "trimmed/ERR2730214_1.fastq.gz" "trimmed/ERR2730216_1.fastq.gz"
## [35] "trimmed/ERR2730217_1.fastq.gz" "trimmed/ERR2730219_1.fastq.gz"
## [37] "trimmed/ERR2730223_1.fastq.gz" "trimmed/ERR2730225_1.fastq.gz"
## [39] "trimmed/ERR2730232_1.fastq.gz" "trimmed/ERR2730233_1.fastq.gz"
## [41] "trimmed/ERR2730234_1.fastq.gz" "trimmed/ERR2730235_1.fastq.gz"
## [43] "trimmed/ERR2730236_1.fastq.gz" "trimmed/ERR2730237_1.fastq.gz"
## [45] "trimmed/ERR2730243_1.fastq.gz" "trimmed/ERR2730246_1.fastq.gz"
## [47] "trimmed/ERR2730247_1.fastq.gz" "trimmed/ERR2730248_1.fastq.gz"
## [49] "trimmed/ERR2730249_1.fastq.gz" "trimmed/ERR2730253_1.fastq.gz"
## [51] "trimmed/ERR2730256_1.fastq.gz" "trimmed/ERR2730258_1.fastq.gz"
## [53] "trimmed/ERR2730259_1.fastq.gz" "trimmed/ERR2730261_1.fastq.gz"
## [55] "trimmed/ERR2730265_1.fastq.gz" "trimmed/ERR2730267_1.fastq.gz"
## [57] "trimmed/ERR2730269_1.fastq.gz" "trimmed/ERR2730270_1.fastq.gz"
## [59] "trimmed/ERR2730272_1.fastq.gz" "trimmed/ERR2730274_1.fastq.gz"
## [61] "trimmed/ERR2730275_1.fastq.gz" "trimmed/ERR2730277_1.fastq.gz"
## [63] "trimmed/ERR2730279_1.fastq.gz" "trimmed/ERR2730282_1.fastq.gz"
## [65] "trimmed/ERR2730283_1.fastq.gz" "trimmed/ERR2730286_1.fastq.gz"
## [67] "trimmed/ERR2730290_1.fastq.gz" "trimmed/ERR2730291_1.fastq.gz"
## [69] "trimmed/ERR2730293_1.fastq.gz" "trimmed/ERR2730294_1.fastq.gz"
## [71] "trimmed/ERR2730298_1.fastq.gz" "trimmed/ERR2730299_1.fastq.gz"
## [73] "trimmed/ERR2730302_1.fastq.gz" "trimmed/ERR2730306_1.fastq.gz"
## [75] "trimmed/ERR2730307_1.fastq.gz" "trimmed/ERR2730308_1.fastq.gz"
## [77] "trimmed/ERR2730309_1.fastq.gz" "trimmed/ERR2730312_1.fastq.gz"
## [79] "trimmed/ERR2730315_1.fastq.gz" "trimmed/ERR2730317_1.fastq.gz"
## [81] "trimmed/ERR2730318_1.fastq.gz" "trimmed/ERR2730321_1.fastq.gz"
## [83] "trimmed/ERR2730322_1.fastq.gz" "trimmed/ERR2730324_1.fastq.gz"
## [85] "trimmed/ERR2730325_1.fastq.gz" "trimmed/ERR2730326_1.fastq.gz"
## [87] "trimmed/ERR2730327_1.fastq.gz" "trimmed/ERR2730331_1.fastq.gz"
## [89] "trimmed/ERR2730332_1.fastq.gz" "trimmed/ERR2730334_1.fastq.gz"
## [91] "trimmed/ERR2730337_1.fastq.gz" "trimmed/ERR2730339_1.fastq.gz"
## [93] "trimmed/ERR2730341_1.fastq.gz" "trimmed/ERR2730342_1.fastq.gz"
## [95] "trimmed/ERR2730344_1.fastq.gz" "trimmed/ERR2730345_1.fastq.gz"
## [97] "trimmed/ERR2730347_1.fastq.gz" "trimmed/ERR2730351_1.fastq.gz"
## [99] "trimmed/ERR2730353_1.fastq.gz" "trimmed/ERR2730360_1.fastq.gz"
## [101] "trimmed/ERR2730361_1.fastq.gz" "trimmed/ERR2730362_1.fastq.gz"
## [103] "trimmed/ERR2730363_1.fastq.gz" "trimmed/ERR2730364_1.fastq.gz"
## [105] "trimmed/ERR2730365_1.fastq.gz" "trimmed/ERR2730371_1.fastq.gz"
## [107] "trimmed/ERR2730374_1.fastq.gz" "trimmed/ERR2730375_1.fastq.gz"
## [109] "trimmed/ERR2730376_1.fastq.gz" "trimmed/ERR2730377_1.fastq.gz"
## [111] "trimmed/ERR2730381_1.fastq.gz" "trimmed/ERR2730384_1.fastq.gz"
## [113] "trimmed/ERR2730386_1.fastq.gz" "trimmed/ERR2730387_1.fastq.gz"
## [115] "trimmed/ERR2730389_1.fastq.gz" "trimmed/ERR2730393_1.fastq.gz"
## [117] "trimmed/ERR2730395_1.fastq.gz" "trimmed/ERR2730397_1.fastq.gz"
## [119] "trimmed/ERR2730398_1.fastq.gz" "trimmed/ERR2730400_1.fastq.gz"
## [121] "trimmed/ERR2730402_1.fastq.gz" "trimmed/ERR2730403_1.fastq.gz"
## [123] "trimmed/ERR3040013_1.fastq.gz" "trimmed/ERR3040014_1.fastq.gz"
## [125] "trimmed/ERR3040017_1.fastq.gz" "trimmed/ERR3040018_1.fastq.gz"
fnRs
## [1] "trimmed/ERR2730149_2.fastq.gz" "trimmed/ERR2730151_2.fastq.gz"
## [3] "trimmed/ERR2730154_2.fastq.gz" "trimmed/ERR2730155_2.fastq.gz"
## [5] "trimmed/ERR2730158_2.fastq.gz" "trimmed/ERR2730162_2.fastq.gz"
## [7] "trimmed/ERR2730163_2.fastq.gz" "trimmed/ERR2730165_2.fastq.gz"
## [9] "trimmed/ERR2730166_2.fastq.gz" "trimmed/ERR2730170_2.fastq.gz"
## [11] "trimmed/ERR2730171_2.fastq.gz" "trimmed/ERR2730174_2.fastq.gz"
## [13] "trimmed/ERR2730178_2.fastq.gz" "trimmed/ERR2730179_2.fastq.gz"
## [15] "trimmed/ERR2730180_2.fastq.gz" "trimmed/ERR2730181_2.fastq.gz"
## [17] "trimmed/ERR2730184_2.fastq.gz" "trimmed/ERR2730187_2.fastq.gz"
## [19] "trimmed/ERR2730189_2.fastq.gz" "trimmed/ERR2730190_2.fastq.gz"
## [21] "trimmed/ERR2730193_2.fastq.gz" "trimmed/ERR2730194_2.fastq.gz"
## [23] "trimmed/ERR2730196_2.fastq.gz" "trimmed/ERR2730197_2.fastq.gz"
## [25] "trimmed/ERR2730198_2.fastq.gz" "trimmed/ERR2730199_2.fastq.gz"
## [27] "trimmed/ERR2730203_2.fastq.gz" "trimmed/ERR2730204_2.fastq.gz"
## [29] "trimmed/ERR2730206_2.fastq.gz" "trimmed/ERR2730209_2.fastq.gz"
## [31] "trimmed/ERR2730211_2.fastq.gz" "trimmed/ERR2730213_2.fastq.gz"
## [33] "trimmed/ERR2730214_2.fastq.gz" "trimmed/ERR2730216_2.fastq.gz"
## [35] "trimmed/ERR2730217_2.fastq.gz" "trimmed/ERR2730219_2.fastq.gz"
## [37] "trimmed/ERR2730223_2.fastq.gz" "trimmed/ERR2730225_2.fastq.gz"
## [39] "trimmed/ERR2730232_2.fastq.gz" "trimmed/ERR2730233_2.fastq.gz"
## [41] "trimmed/ERR2730234_2.fastq.gz" "trimmed/ERR2730235_2.fastq.gz"
## [43] "trimmed/ERR2730236_2.fastq.gz" "trimmed/ERR2730237_2.fastq.gz"
## [45] "trimmed/ERR2730243_2.fastq.gz" "trimmed/ERR2730246_2.fastq.gz"
## [47] "trimmed/ERR2730247_2.fastq.gz" "trimmed/ERR2730248_2.fastq.gz"
## [49] "trimmed/ERR2730249_2.fastq.gz" "trimmed/ERR2730253_2.fastq.gz"
## [51] "trimmed/ERR2730256_2.fastq.gz" "trimmed/ERR2730258_2.fastq.gz"
## [53] "trimmed/ERR2730259_2.fastq.gz" "trimmed/ERR2730261_2.fastq.gz"
## [55] "trimmed/ERR2730265_2.fastq.gz" "trimmed/ERR2730267_2.fastq.gz"
## [57] "trimmed/ERR2730269_2.fastq.gz" "trimmed/ERR2730270_2.fastq.gz"
## [59] "trimmed/ERR2730272_2.fastq.gz" "trimmed/ERR2730274_2.fastq.gz"
## [61] "trimmed/ERR2730275_2.fastq.gz" "trimmed/ERR2730277_2.fastq.gz"
## [63] "trimmed/ERR2730279_2.fastq.gz" "trimmed/ERR2730282_2.fastq.gz"
## [65] "trimmed/ERR2730283_2.fastq.gz" "trimmed/ERR2730286_2.fastq.gz"
## [67] "trimmed/ERR2730290_2.fastq.gz" "trimmed/ERR2730291_2.fastq.gz"
## [69] "trimmed/ERR2730293_2.fastq.gz" "trimmed/ERR2730294_2.fastq.gz"
## [71] "trimmed/ERR2730298_2.fastq.gz" "trimmed/ERR2730299_2.fastq.gz"
## [73] "trimmed/ERR2730302_2.fastq.gz" "trimmed/ERR2730306_2.fastq.gz"
## [75] "trimmed/ERR2730307_2.fastq.gz" "trimmed/ERR2730308_2.fastq.gz"
## [77] "trimmed/ERR2730309_2.fastq.gz" "trimmed/ERR2730312_2.fastq.gz"
## [79] "trimmed/ERR2730315_2.fastq.gz" "trimmed/ERR2730317_2.fastq.gz"
## [81] "trimmed/ERR2730318_2.fastq.gz" "trimmed/ERR2730321_2.fastq.gz"
## [83] "trimmed/ERR2730322_2.fastq.gz" "trimmed/ERR2730324_2.fastq.gz"
## [85] "trimmed/ERR2730325_2.fastq.gz" "trimmed/ERR2730326_2.fastq.gz"
## [87] "trimmed/ERR2730327_2.fastq.gz" "trimmed/ERR2730331_2.fastq.gz"
## [89] "trimmed/ERR2730332_2.fastq.gz" "trimmed/ERR2730334_2.fastq.gz"
## [91] "trimmed/ERR2730337_2.fastq.gz" "trimmed/ERR2730339_2.fastq.gz"
## [93] "trimmed/ERR2730341_2.fastq.gz" "trimmed/ERR2730342_2.fastq.gz"
## [95] "trimmed/ERR2730344_2.fastq.gz" "trimmed/ERR2730345_2.fastq.gz"
## [97] "trimmed/ERR2730347_2.fastq.gz" "trimmed/ERR2730351_2.fastq.gz"
## [99] "trimmed/ERR2730353_2.fastq.gz" "trimmed/ERR2730360_2.fastq.gz"
## [101] "trimmed/ERR2730361_2.fastq.gz" "trimmed/ERR2730362_2.fastq.gz"
## [103] "trimmed/ERR2730363_2.fastq.gz" "trimmed/ERR2730364_2.fastq.gz"
## [105] "trimmed/ERR2730365_2.fastq.gz" "trimmed/ERR2730371_2.fastq.gz"
## [107] "trimmed/ERR2730374_2.fastq.gz" "trimmed/ERR2730375_2.fastq.gz"
## [109] "trimmed/ERR2730376_2.fastq.gz" "trimmed/ERR2730377_2.fastq.gz"
## [111] "trimmed/ERR2730381_2.fastq.gz" "trimmed/ERR2730384_2.fastq.gz"
## [113] "trimmed/ERR2730386_2.fastq.gz" "trimmed/ERR2730387_2.fastq.gz"
## [115] "trimmed/ERR2730389_2.fastq.gz" "trimmed/ERR2730393_2.fastq.gz"
## [117] "trimmed/ERR2730395_2.fastq.gz" "trimmed/ERR2730397_2.fastq.gz"
## [119] "trimmed/ERR2730398_2.fastq.gz" "trimmed/ERR2730400_2.fastq.gz"
## [121] "trimmed/ERR2730402_2.fastq.gz" "trimmed/ERR2730403_2.fastq.gz"
## [123] "trimmed/ERR3040013_2.fastq.gz" "trimmed/ERR3040014_2.fastq.gz"
## [125] "trimmed/ERR3040017_2.fastq.gz" "trimmed/ERR3040018_2.fastq.gz"
sample.names
## [1] "ERR2730149" "ERR2730151" "ERR2730154" "ERR2730155" "ERR2730158" "ERR2730162"
## [7] "ERR2730163" "ERR2730165" "ERR2730166" "ERR2730170" "ERR2730171" "ERR2730174"
## [13] "ERR2730178" "ERR2730179" "ERR2730180" "ERR2730181" "ERR2730184" "ERR2730187"
## [19] "ERR2730189" "ERR2730190" "ERR2730193" "ERR2730194" "ERR2730196" "ERR2730197"
## [25] "ERR2730198" "ERR2730199" "ERR2730203" "ERR2730204" "ERR2730206" "ERR2730209"
## [31] "ERR2730211" "ERR2730213" "ERR2730214" "ERR2730216" "ERR2730217" "ERR2730219"
## [37] "ERR2730223" "ERR2730225" "ERR2730232" "ERR2730233" "ERR2730234" "ERR2730235"
## [43] "ERR2730236" "ERR2730237" "ERR2730243" "ERR2730246" "ERR2730247" "ERR2730248"
## [49] "ERR2730249" "ERR2730253" "ERR2730256" "ERR2730258" "ERR2730259" "ERR2730261"
## [55] "ERR2730265" "ERR2730267" "ERR2730269" "ERR2730270" "ERR2730272" "ERR2730274"
## [61] "ERR2730275" "ERR2730277" "ERR2730279" "ERR2730282" "ERR2730283" "ERR2730286"
## [67] "ERR2730290" "ERR2730291" "ERR2730293" "ERR2730294" "ERR2730298" "ERR2730299"
## [73] "ERR2730302" "ERR2730306" "ERR2730307" "ERR2730308" "ERR2730309" "ERR2730312"
## [79] "ERR2730315" "ERR2730317" "ERR2730318" "ERR2730321" "ERR2730322" "ERR2730324"
## [85] "ERR2730325" "ERR2730326" "ERR2730327" "ERR2730331" "ERR2730332" "ERR2730334"
## [91] "ERR2730337" "ERR2730339" "ERR2730341" "ERR2730342" "ERR2730344" "ERR2730345"
## [97] "ERR2730347" "ERR2730351" "ERR2730353" "ERR2730360" "ERR2730361" "ERR2730362"
## [103] "ERR2730363" "ERR2730364" "ERR2730365" "ERR2730371" "ERR2730374" "ERR2730375"
## [109] "ERR2730376" "ERR2730377" "ERR2730381" "ERR2730384" "ERR2730386" "ERR2730387"
## [115] "ERR2730389" "ERR2730393" "ERR2730395" "ERR2730397" "ERR2730398" "ERR2730400"
## [121] "ERR2730402" "ERR2730403" "ERR3040013" "ERR3040014" "ERR3040017" "ERR3040018"
We use the plotQualityProfile
function to plot the quality profiles of the trimmed fastq files.
# Plot quality profile of fastq files
1:length(sample.names)
ii =pdf(paste0(images, "/plotQualityProfile.pdf"), width = 8, height = 8, pointsize = 12)
for(i in ii) {
message(paste0("[", i ,"/", length(sample.names), "] ", sample.names[i]))
print(plotQualityProfile(fnFs[i]) + ggtitle("Fwd"))
print(plotQualityProfile(fnRs[i]) + ggtitle("Rev"))
}invisible(dev.off())
We review the quality profiles, the quality distribution of forward reads appears to drop after position 260 and position 200 for reverse reads. We will use these positions to truncate forward and reverse reads respectively in the next step.
Remember to review plotQualityProfile.pdf to select the best paramters for truncLen
argument.
# Set paths to the dada2-filterd files
file.path(filt, basename(fnFs))
filtFs = file.path(filt, basename(fnRs))
filtRs =
# Perform filtering and trimming
filterAndTrim(fnFs, filtFs, fnRs, filtRs,
out =# Need to keep paramters consistent between runs of the same study
truncLen = c(260,200), minLen = 200, maxN = 0, truncQ = 2, maxEE = c(2,5),
rm.phix = TRUE, compress = TRUE, verbose = TRUE, multithread = TRUE)
as.data.frame(out)
out =rownames(out) = sample.names
head(out, 10)
## reads.in reads.out
## ERR2730149 65747 58829
## ERR2730151 70743 61547
## ERR2730154 91558 81374
## ERR2730155 58955 53269
## ERR2730158 128246 116847
## ERR2730162 134565 121999
## ERR2730163 116621 105093
## ERR2730165 57401 51513
## ERR2730166 121037 108838
## ERR2730170 57359 51276
Use the learnErrors
function to perform dereplication and learn the error rates. The derepFastq
function used in past workflow has been intergrated into learnErrors
function.
learnErrors(filtFs, multithread = TRUE) errF =
## 128404900 total bases in 493865 reads from 6 samples will be used for learning the error rates.
learnErrors(filtRs, multithread = TRUE) errR =
## 119791600 total bases in 598958 reads from 7 samples will be used for learning the error rates.
# Visualize the estimated error rates
pdf(paste0(images, "/plotErrors.pdf"), width = 10, height = 10, pointsize = 12)
plotErrors(errF, nominalQ = TRUE)
plotErrors(errR, nominalQ = TRUE)
invisible(dev.off())
Forward reads
Reverse reads
We now apply the core sample inference algorithm to the filtered and trimmed sequence data.
Note: By default, the
dada
function processes each sample independently (i.e.pool = FALSE
), if your samples are from an extremely diverse community (e.g. soil), pooling (i.e.pool = TRUE
) or pseudo-pooling (recommended; i.e.pool = pseudo
) might help in identifying the rare ASVs in each sample
dada(filtFs, err = errF, pool = FALSE, multithread = TRUE)
dadaFs = dada(filtRs, err = errR, pool = FALSE, multithread = TRUE) dadaRs =
We now merge the forward and reverse reads together to obtain the full denoised sequences.
mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE) mergers =
We now construct an amplicon sequence variant (ASV) table.
makeSequenceTable(mergers) seqtab =
View the length frequency distribution with table
.
table(nchar(getSequences(seqtab)))
##
## 260 261 264 267 270 272 273 275 277 278 279 281 282 284 286
## 461 7 1 1 7 3 7 2 10 7 11 14 1 1 1
## 288 300 301 313 317 320 325 335 341 365 384 385 386 400 401
## 1 2 1 1 2 1 2 1 1 1 4 2 3 2 724
## 402 403 404 405 406 407 408 409 410 411 412 413 417 418 419
## 11302 2584 4587 4221 126 257 112 124 34 264 13 1 2 8 101
## 420 421 422 423 424 425 426 427 428 429 430 431 432 433 438
## 67 531 3606 326 82 68 206 1082 452 7 12 1 8 3 1
## 439 440 441 442 443 444 445 446 447 448
## 24 7 13 10 9 10 4 6 5 7
Save sequence table.
saveRDS(seqtab, "seqtab.rds")
# Or as an example, save only the first 5 samples
# saveRDS(seqtab[c(1:5),], "seqtab.rds")
This section elaborates a little about merging results from multiple sequencing run. You can merge sequence tables from multiple runs belonging to the same experiment or project before moving on to chimera removal and taxonomy assignment. In the example snippet below, we use mergeSequenceTables
to combine 2 sequence tables.
Note: Skip this step if you do not require merging multiple sequence tables
# Load sequence tables from multiple runs
readRDS("/some-path/run1/seqtab.rds")
seqtab1 = readRDS("/some-path/run2/seqtab.rds")
seqtab2 =
# Check the dimensions and sample names
dim(seqtab1)
rownames(seqtab1)
dim(seqtab2)
rownames(seqtab2)
# Merge sequence table and remove chimeras
mergeSequenceTables(seqtab1, seqtab2)
seqtab =
# To sum values of same sample from multiple sequence table (i.e. when a sample was re-sequenced due to low depth)
# seqtab = mergeSequenceTables(seqtab1, seqtab2, repeats = "sum")
# Check the dimension of the merged sequence table
dim(seqtab)
After obtaining the merged sequence table, you can continue with the below data processing steps.
removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE,
seqtab.nochim =verbose = TRUE)
## Identified 23851 bimeras out of 31565 input sequences.
# Update the same names to exclude "fastq.gz" in name
rownames(seqtab.nochim) = sample.names
View the dimensions of seqtab
and seqtab.nochim
.
dim(seqtab)
## [1] 126 31565
dim(seqtab.nochim)
## [1] 126 7714
Print the proportion of non-chimeras in merged sequence reads.
sum(seqtab.nochim)/sum(seqtab)
## [1] 0.9683915
function(x) sum(getUniques(x))
getN <-
cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN),
track =sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) = c("Trimmed", "Filtered", "denoisedF", "denoisedR", "merged", "nonchim")
cbind(data.frame(SampleID = sample.names), track)
track =
write.table(track, file = paste0(outfiles, "/track.txt"), sep = "\t", quote = F,
row.names = F, col.names = T)
Set the full-path to the Silva and NCBI database.
"/path-to-db/"
dbpath = paste0(dbpath, "silva_nr99_v138.1_train_set.fa.gz")
ref1 = paste0(dbpath, "silva_species_assignment_v138.1.fa.gz")
ref2 = paste0(dbpath, "16SMicrobial.fa.gz") ref3 =
Use the assignTaxonomy
function to classifies sequences against SILVA reference training dataset ref1
, and use the assignSpecies
function to perform taxonomic assignment to the species level by exact matching against SILVA ref2
and NCBI reference datasets ref3
.
# Extracts the sequences
getSequences(seqtab.nochim)
seqs =
set.seed(12345)
assignTaxonomy(seqs, refFasta = ref1, minBoot = 80, tryRC = TRUE, verbose = TRUE,
taxtab =outputBootstraps = TRUE, multithread = TRUE)
## Finished processing reference fasta.
assignSpecies(seqs, ref2, allowMultiple = FALSE, tryRC = TRUE, verbose = TRUE) spec_silva =
## 473 out of 7714 were assigned to the species level.
assignSpecies(seqs, ref3, allowMultiple = FALSE, tryRC = TRUE, verbose = TRUE) spec_ncbi =
## 355 out of 7714 were assigned to the species level.
Combine species-level taxonomic assignment from 2 reference sources.
paste("%0",nchar(as.character(ncol(seqtab.nochim))),"d", sep = "")
SVformat = paste0("SV", sprintf(SVformat, seq(ncol(seqtab.nochim))))
svid =
as.data.frame(spec_silva, stringsAsFactors = FALSE)
s_silva =rownames(s_silva) = svid
as.data.frame(spec_ncbi, stringsAsFactors = FALSE)
s_ncbi =rownames(s_ncbi) = svid
$Genus = gsub("\\[|\\]", "", s_ncbi$Genus)
s_ncbi
cbind(s_ncbi, s_silva)
s_merged =colnames(s_merged) = c("nGenus","nSpecies","sGenus","sSpecies")
s_merged[!is.na(s_merged$nSpecies),]
s_merged1 =colnames(s_merged1)[1:2] = c("Genus","Species")
s_merged[is.na(s_merged$nSpecies) & !is.na(s_merged$sSpecies),]
s_merged2 =colnames(s_merged2)[3:4] = c("Genus","Species")
s_merged[is.na(s_merged$nSpecies) & is.na(s_merged$sSpecies),]
s_merged3 =colnames(s_merged3)[3:4] = c("Genus","Species")
rbind(s_merged1[,c("Genus","Species")], s_merged2[,c("Genus","Species")],
s_final =c("Genus","Species")])
s_merged3[, s_final[order(row.names(s_final)),]
s_final =
as.matrix(s_final)
s_final =
if("Genus" %in% colnames(taxtab$tax)) {
which(colnames(taxtab$tax) == "Genus")
gcol =else {
} ncol(taxtab$tax)
gcol =
}
function(gen.tax, gen.binom, split.glyph = "/") {
matchGenera <-if(is.na(gen.tax) || is.na(gen.binom)) { return(FALSE) }
if((gen.tax == gen.binom) ||
grepl(paste0("^", gen.binom, "[ _", split.glyph, "]"), gen.tax) ||
grepl(paste0(split.glyph, gen.binom, "$"), gen.tax)) {
return(TRUE)
else {
} return(FALSE)
}
}
mapply(matchGenera, taxtab$tax[,gcol], s_final[,1])
gen.match =$tax = cbind(taxtab$tax, s_final[,2])
taxtabcolnames(taxtab$tax)[ncol(taxtab$tax)] = "Species"
print(paste(sum(!is.na(s_final[,2])), "out of",
nrow(s_final), "were assigned to the species level."))
## [1] "587 out of 7714 were assigned to the species level."
$tax[!gen.match,"Species"] = NA
taxtabprint(paste("Of which", sum(!is.na(taxtab$tax[,"Species"])),
"had genera consistent with the input table."))
## [1] "Of which 476 had genera consistent with the input table."
Prepare a data.frame
df
from seqtab.nochim
and taxtab
.
data.frame(sequence = seqs, abundance = colSums(seqtab.nochim), stringsAsFactors = FALSE)
df =$id = svid
df merge(df, as.data.frame(taxtab), by = "row.names")
df =rownames(df) = df$id
df[order(df$id),2:ncol(df)] df =
Performs alignment of multiple unaligned sequences.
AlignSeqs(Biostrings::DNAStringSet(setNames(df$sequence, df$id)), anchor = NA) alignment =
## Determining distance matrix based on shared 8-mers:
## ==========================================================================================
##
## Time difference of 479.97 secs
##
## Clustering into groups by similarity:
## ==========================================================================================
##
## Time difference of 7.04 secs
##
## Aligning Sequences:
## ==========================================================================================
##
## Time difference of 85.3 secs
##
## Iteration 1 of 2:
##
## Determining distance matrix based on alignment:
## ==========================================================================================
##
## Time difference of 76.33 secs
##
## Reclustering into groups by similarity:
## ==========================================================================================
##
## Time difference of 5.85 secs
##
## Realigning Sequences:
## ==========================================================================================
##
## Time difference of 61.57 secs
##
## Iteration 2 of 2:
##
## Determining distance matrix based on alignment:
## ==========================================================================================
##
## Time difference of 66.86 secs
##
## Reclustering into groups by similarity:
## ==========================================================================================
##
## Time difference of 5.87 secs
##
## Realigning Sequences:
## ==========================================================================================
##
## Time difference of 14.13 secs
##
## Refining the alignment:
## ==========================================================================================
##
## Time difference of 0.92 secs
Export alignment.
phyDat(as(alignment, "matrix"), type = "DNA")
phang.align =write.phyDat(phang.align, file = "alignment.fasta", format = "fasta")
write.phyDat(phang.align, file = "alignment.aln", format = "phylip")
Set the full-path to the RAxML and RAxML-NG.
"/path-to-raxml-binary" # e.g. /usr/local/bin/raxmlHPC-PTHREADS-SSE3
raxml = "/path-to-raxml-ng-binary" # e.g. /usr/local/bin/raxml-ng raxmlng =
Run RAxML and RAxML-NG.
system2(raxml, args = c("-T 2", "-f E", "-p 1234", "-x 5678", "-m GTRCAT", "-N 1",
"-s alignment.aln", "-n raxml_tree_GTRCAT"))
system2(raxmlng, args = c("--evaluate", "--force", "--seed 1234", "--log progress", "--threads 2",
"--msa alignment.fasta", "--model GTR+G", "--brlen scaled",
"--tree RAxML_fastTree.raxml_tree_GTRCAT", "--prefix GTRCAT"))
Import tree using the read_tree
function from phyloseq
.
read_tree("GTRCAT.raxml.bestTree") raxml_tree =
data.frame(fread("sample.meta", colClasses = "character"))
samdf =
rownames(samdf) = samdf$Sample_ID
$Sample_ID = as.factor(samdf$Sample_ID)
samdf$Sample_ID = factor(samdf$Sample_ID, levels = c(sort(levels(samdf$Sample_ID), decreasing = F)))
samdf$Subject = as.factor(samdf$Subject)
samdf$Group2 = paste(samdf$Group, samdf$Time, sep = "_")
samdf$Group = as.factor(samdf$Group)
samdf$Group2 = as.factor(samdf$Group2)
samdf$Time = as.factor(samdf$Time)
samdf
head(samdf)
## Run_ID Sample_ID Sample_title Subject Group Time Group2
## C0005B ERR2730149 C0005B C0005 baseline C0005 control baseline control_baseline
## C0009B ERR2730151 C0009B C0009 baseline C0009 control baseline control_baseline
## C0019B ERR2730154 C0019B C0019 baseline C0019 control baseline control_baseline
## C0020B ERR2730155 C0020B C0020 baseline C0020 control baseline control_baseline
## C0024B ERR2730158 C0024B C0024 baseline C0024 control baseline control_baseline
## C0032B ERR2730162 C0032B C0032 baseline C0032 control baseline control_baseline
phyloseq
Prepare new_seqtab
and tax
data.frames containing abundance and taxonomy information respectively.
seqtab.nochim
new_seqtab =colnames(new_seqtab) = df[match(colnames(new_seqtab), df$sequence),]$id
# Update rownames in new_seqtab from Run_ID to Sample_ID
rownames(new_seqtab) = as.character(samdf[match(rownames(new_seqtab), samdf$Run_ID),]$Sample_ID)
taxtab
new_taxtab =rownames(new_taxtab$tax) = df[match(rownames(new_taxtab$tax), df$sequence),]$id
as.data.frame(new_taxtab$tax)
tax =$Family = as.character(tax$Family)
tax$Genus = as.character(tax$Genus) tax
Construct a phyloseq
object.
phyloseq(tax_table(as.matrix(tax)),
ps =sample_data(samdf),
otu_table(new_seqtab, taxa_are_rows = FALSE),
phy_tree(raxml_tree))
ps
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 7714 taxa and 126 samples ]
## sample_data() Sample Data: [ 126 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 7714 taxa by 7 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 7714 tips and 7712 internal nodes ]
save.image(file = "1_dada2_tutorial.RData")
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3_4.10.0/envs/r4-16S/lib/libopenblasp-r0.3.15.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C LC_TIME=en_GB.UTF-8
## [4] LC_COLLATE=en_GB.UTF-8 LC_MONETARY=en_GB.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=en_GB.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets methods
## [9] base
##
## other attached packages:
## [1] ggplot2_3.3.4 phyloseq_1.36.0 phangorn_2.7.0 ape_5.5
## [5] DECIPHER_2.20.0 RSQLite_2.2.5 Biostrings_2.60.0 GenomeInfoDb_1.28.0
## [9] XVector_0.32.0 IRanges_2.26.0 S4Vectors_0.30.0 BiocGenerics_0.38.0
## [13] data.table_1.14.0 dada2_1.20.0 Rcpp_1.0.6 knitr_1.33
##
## loaded via a namespace (and not attached):
## [1] nlme_3.1-152 bitops_1.0-7 matrixStats_0.59.0
## [4] bit64_4.0.5 RColorBrewer_1.1-2 tools_4.1.0
## [7] vegan_2.5-7 utf8_1.2.1 R6_2.5.0
## [10] mgcv_1.8-36 DBI_1.1.1 colorspace_2.0-2
## [13] permute_0.9-5 rhdf5filters_1.4.0 ade4_1.7-17
## [16] withr_2.4.2 gridExtra_2.3 tidyselect_1.1.1
## [19] bit_4.0.4 compiler_4.1.0 Biobase_2.52.0
## [22] DelayedArray_0.18.0 labeling_0.4.2 scales_1.1.1
## [25] quadprog_1.5-8 stringr_1.4.0 digest_0.6.27
## [28] Rsamtools_2.8.0 rmarkdown_2.9 jpeg_0.1-8.1
## [31] pkgconfig_2.0.3 htmltools_0.5.1.1 MatrixGenerics_1.4.0
## [34] highr_0.9 fastmap_1.1.0 rlang_0.4.11
## [37] farver_2.1.0 generics_0.1.0 hwriter_1.3.2
## [40] jsonlite_1.7.2 BiocParallel_1.26.0 dplyr_1.0.7
## [43] RCurl_1.98-1.3 magrittr_2.0.1 GenomeInfoDbData_1.2.6
## [46] biomformat_1.20.0 Matrix_1.3-4 Rhdf5lib_1.14.0
## [49] munsell_0.5.0 fansi_0.4.2 lifecycle_1.0.0
## [52] stringi_1.6.2 yaml_2.2.1 MASS_7.3-54
## [55] SummarizedExperiment_1.22.0 zlibbioc_1.38.0 rhdf5_2.36.0
## [58] plyr_1.8.6 grid_4.1.0 blob_1.2.1
## [61] crayon_1.4.1 lattice_0.20-44 splines_4.1.0
## [64] multtest_2.48.0 pillar_1.6.1 igraph_1.2.6
## [67] GenomicRanges_1.44.0 reshape2_1.4.4 codetools_0.2-18
## [70] fastmatch_1.1-0 glue_1.4.2 evaluate_0.14
## [73] ShortRead_1.50.0 latticeExtra_0.6-29 RcppParallel_5.1.4
## [76] png_0.1-7 vctrs_0.3.8 foreach_1.5.1
## [79] gtable_0.3.0 purrr_0.3.4 assertthat_0.2.1
## [82] cachem_1.0.5 xfun_0.24 survival_3.2-11
## [85] tibble_3.1.2 iterators_1.0.13 GenomicAlignments_1.28.0
## [88] memoise_2.0.0 cluster_2.1.2 ellipsis_0.3.2