16S-rDNA-V3-V4 repository: https://github.com/ycl6/16S-rDNA-V3-V4

cutadapt: GitHub, Documentation, Paper

Demo Dataset: PRJEB27564 from Gut Microbiota in Parkinson’s Disease: Temporal Stability and Relations to Disease Progression. EBioMedicine. 2019;44:691-707

License: GPL-3.0

Primer trimming

The PCR primer sequences can be removed using any primer or adapter trimming softwares. The cutadapt software offers options to specify primer/adapter sequences from both 5’ and 3’ ends and also from second read of both ends, making it an ideal choice to perform this task.

The parameters -a and -g specify adapters to be removed from each read. The -A and -G options work like their -a and -g counterparts, but are applied to the second read in each pair. So the trimming parameters would be like this:

cutadapt \
    -g 5'-end-adapter-seq \
    -a 3'-end-adapter-seq \
    -G 5'-end-adapter-from-second-read-seq \
    -A 3'-end-adapter-from-second-read-seq

For a 341F (5’-CCTACGGGNGGCWGCAG-3’) and 785R (5’-GACTACHVGGGTATCTAATCC-3’) primer set, the trimming parameters would be like this:

cutadapt \
    -g CCTACGGGNGGCWGCAG \      # 341F seq
    -a GGATTAGATACCCBDGTAGTC \  # 785R reverse complement seq
    -G GACTACHVGGGTATCTAATCC \  # 785R seq
    -A CTGCWGCCNCCCGTAGG        # 341F reverse complement seq

You can use the included Perl script run_trimming.pl to perform primer trimming. It would determine the reverse complement sequences of the primer set you provided and perform trimming.

In the below example, the raw fastq files are placed in the raw folder and the trimmed fastq files will be placed in the pre-defined folder named trimmed.

cd /ngs/16S-Demo

# Usage: perl run_trimming.pl project_folder fastq_folder forward_primer_seq reverse_primer_seq
perl run_trimming.pl PRJEB27564 raw CCTACGGGNGGCWGCAG GACTACHVGGGTATCTAATCC

Or follow the instruction below to run cutadapt within R.

Start `R`

cd /ngs/16S-Demo/PRJEB27564

R

Load package and set path

library("dada2")
library("data.table")
library("DECIPHER")
library("phangorn")
library("phyloseq")
library("ggplot2")

Change the fastq path to correspond to the location of the raw fastq files.

fastq = "raw"           # raw fastq files
trimmed = "trimmed"     # cutadapt trimmed fastq files
filt = "filt"           # dada2 trimmed fastq files
outfiles = "outfiles"   # output files
images = "images"       # output images

if(!dir.exists(filt)) dir.create(filt)
if(!dir.exists(outfiles)) dir.create(outfiles)
if(!dir.exists(images)) dir.create(images)

head(list.files(fastq),15)

##  [1] "download.sh"           "ERR2730149_1.fastq.gz" "ERR2730149_2.fastq.gz"
##  [4] "ERR2730151_1.fastq.gz" "ERR2730151_2.fastq.gz" "ERR2730154_1.fastq.gz"
##  [7] "ERR2730154_2.fastq.gz" "ERR2730155_1.fastq.gz" "ERR2730155_2.fastq.gz"
## [10] "ERR2730158_1.fastq.gz" "ERR2730158_2.fastq.gz" "ERR2730162_1.fastq.gz"
## [13] "ERR2730162_2.fastq.gz" "ERR2730163_1.fastq.gz" "ERR2730163_2.fastq.gz"

*Primer trimming in R

The trimming program cutadapt is called using system2 function to perform trimming.

Note: Skip this step if trimmed has been performed

fns = sort(list.files(fastq, full.names = TRUE))
fnFs = fns[grep("1.fastq.gz", fns)]
fnRs = fns[grep("2.fastq.gz", fns)]

if(!dir.exists(trimmed)) dir.create(trimmed)

fnFs.cut = file.path(trimmed, basename(fnFs))
fnRs.cut = file.path(trimmed, basename(fnRs))
log.cut = gsub(".1.fastq.gz", ".log", fnFs.cut)
sample.names = gsub(".1.fastq.gz", "", basename(fnFs.cut))

# Define the primer set used to perform PCR
FWD = "CCTACGGGNGGCWGCAG"       # 341F
REV = "GACTACHVGGGTATCTAATCC"   # 785R

# Get reverse complement DNA sequences
FWD.RC = dada2::rc(FWD)
REV.RC = dada2::rc(REV)

# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags = paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags = paste("-G", REV, "-A", FWD.RC)

# Run cutadapt to remove primers
# Note to change the PATH to cutadapt accordingly
cutadapt = "/path-to-bin/cutadapt"

for(i in seq_along(fnFs)) {
    print(paste0("[", i ,"/", length(sample.names), "] ", sample.names[i]))
    
    system2(cutadapt,
        stdout = log.cut[i], stderr = log.cut[i], # log file
        args = c(R1.flags, R2.flags,
        "-n 2",                   # -n 2 required to remove FWD and REV from reads
        "--match-read-wildcards", # enable IUPAC nucleotide codes (wildcard characters)
        "--length 300",           # Truncate reads to 300 bp
        "-m 150",                 # discard reads shorter than LEN (avoid length zero sequences)
        "--overlap 10",           # min overlap between read and adapter for an adapter to be found
        "-j 0",                   # auto-detection of CPU cores, only available on Python 3
        "-o", fnFs.cut[i], "-p", fnRs.cut[i], # trimmed files
        fnFs[i], fnRs[i])         # input files
    )
}

head(list.files(trimmed),15)

##  [1] "ERR2730149_1.fastq.gz" "ERR2730149_2.fastq.gz" "ERR2730149.log"       
##  [4] "ERR2730151_1.fastq.gz" "ERR2730151_2.fastq.gz" "ERR2730151.log"       
##  [7] "ERR2730154_1.fastq.gz" "ERR2730154_2.fastq.gz" "ERR2730154.log"       
## [10] "ERR2730155_1.fastq.gz" "ERR2730155_2.fastq.gz" "ERR2730155.log"       
## [13] "ERR2730158_1.fastq.gz" "ERR2730158_2.fastq.gz" "ERR2730158.log"

Build trimmed file lists

fns = sort(list.files(trimmed, full.names = TRUE))
fnFs = fns[grep("1.fastq.gz", fns)]     # Update the grep pattern when necessary
fnRs = fns[grep("2.fastq.gz", fns)]     # Update the grep pattern when necessary
sample.names = gsub(".1.fastq.gz", "", basename(fnFs))  # Update the gsub pattern when necessary

# Check objects
fnFs

##   [1] "trimmed/ERR2730149_1.fastq.gz" "trimmed/ERR2730151_1.fastq.gz"
##   [3] "trimmed/ERR2730154_1.fastq.gz" "trimmed/ERR2730155_1.fastq.gz"
##   [5] "trimmed/ERR2730158_1.fastq.gz" "trimmed/ERR2730162_1.fastq.gz"
##   [7] "trimmed/ERR2730163_1.fastq.gz" "trimmed/ERR2730165_1.fastq.gz"
##   [9] "trimmed/ERR2730166_1.fastq.gz" "trimmed/ERR2730170_1.fastq.gz"
##  [11] "trimmed/ERR2730171_1.fastq.gz" "trimmed/ERR2730174_1.fastq.gz"
##  [13] "trimmed/ERR2730178_1.fastq.gz" "trimmed/ERR2730179_1.fastq.gz"
##  [15] "trimmed/ERR2730180_1.fastq.gz" "trimmed/ERR2730181_1.fastq.gz"
##  [17] "trimmed/ERR2730184_1.fastq.gz" "trimmed/ERR2730187_1.fastq.gz"
##  [19] "trimmed/ERR2730189_1.fastq.gz" "trimmed/ERR2730190_1.fastq.gz"
##  [21] "trimmed/ERR2730193_1.fastq.gz" "trimmed/ERR2730194_1.fastq.gz"
##  [23] "trimmed/ERR2730196_1.fastq.gz" "trimmed/ERR2730197_1.fastq.gz"
##  [25] "trimmed/ERR2730198_1.fastq.gz" "trimmed/ERR2730199_1.fastq.gz"
##  [27] "trimmed/ERR2730203_1.fastq.gz" "trimmed/ERR2730204_1.fastq.gz"
##  [29] "trimmed/ERR2730206_1.fastq.gz" "trimmed/ERR2730209_1.fastq.gz"
##  [31] "trimmed/ERR2730211_1.fastq.gz" "trimmed/ERR2730213_1.fastq.gz"
##  [33] "trimmed/ERR2730214_1.fastq.gz" "trimmed/ERR2730216_1.fastq.gz"
##  [35] "trimmed/ERR2730217_1.fastq.gz" "trimmed/ERR2730219_1.fastq.gz"
##  [37] "trimmed/ERR2730223_1.fastq.gz" "trimmed/ERR2730225_1.fastq.gz"
##  [39] "trimmed/ERR2730232_1.fastq.gz" "trimmed/ERR2730233_1.fastq.gz"
##  [41] "trimmed/ERR2730234_1.fastq.gz" "trimmed/ERR2730235_1.fastq.gz"
##  [43] "trimmed/ERR2730236_1.fastq.gz" "trimmed/ERR2730237_1.fastq.gz"
##  [45] "trimmed/ERR2730243_1.fastq.gz" "trimmed/ERR2730246_1.fastq.gz"
##  [47] "trimmed/ERR2730247_1.fastq.gz" "trimmed/ERR2730248_1.fastq.gz"
##  [49] "trimmed/ERR2730249_1.fastq.gz" "trimmed/ERR2730253_1.fastq.gz"
##  [51] "trimmed/ERR2730256_1.fastq.gz" "trimmed/ERR2730258_1.fastq.gz"
##  [53] "trimmed/ERR2730259_1.fastq.gz" "trimmed/ERR2730261_1.fastq.gz"
##  [55] "trimmed/ERR2730265_1.fastq.gz" "trimmed/ERR2730267_1.fastq.gz"
##  [57] "trimmed/ERR2730269_1.fastq.gz" "trimmed/ERR2730270_1.fastq.gz"
##  [59] "trimmed/ERR2730272_1.fastq.gz" "trimmed/ERR2730274_1.fastq.gz"
##  [61] "trimmed/ERR2730275_1.fastq.gz" "trimmed/ERR2730277_1.fastq.gz"
##  [63] "trimmed/ERR2730279_1.fastq.gz" "trimmed/ERR2730282_1.fastq.gz"
##  [65] "trimmed/ERR2730283_1.fastq.gz" "trimmed/ERR2730286_1.fastq.gz"
##  [67] "trimmed/ERR2730290_1.fastq.gz" "trimmed/ERR2730291_1.fastq.gz"
##  [69] "trimmed/ERR2730293_1.fastq.gz" "trimmed/ERR2730294_1.fastq.gz"
##  [71] "trimmed/ERR2730298_1.fastq.gz" "trimmed/ERR2730299_1.fastq.gz"
##  [73] "trimmed/ERR2730302_1.fastq.gz" "trimmed/ERR2730306_1.fastq.gz"
##  [75] "trimmed/ERR2730307_1.fastq.gz" "trimmed/ERR2730308_1.fastq.gz"
##  [77] "trimmed/ERR2730309_1.fastq.gz" "trimmed/ERR2730312_1.fastq.gz"
##  [79] "trimmed/ERR2730315_1.fastq.gz" "trimmed/ERR2730317_1.fastq.gz"
##  [81] "trimmed/ERR2730318_1.fastq.gz" "trimmed/ERR2730321_1.fastq.gz"
##  [83] "trimmed/ERR2730322_1.fastq.gz" "trimmed/ERR2730324_1.fastq.gz"
##  [85] "trimmed/ERR2730325_1.fastq.gz" "trimmed/ERR2730326_1.fastq.gz"
##  [87] "trimmed/ERR2730327_1.fastq.gz" "trimmed/ERR2730331_1.fastq.gz"
##  [89] "trimmed/ERR2730332_1.fastq.gz" "trimmed/ERR2730334_1.fastq.gz"
##  [91] "trimmed/ERR2730337_1.fastq.gz" "trimmed/ERR2730339_1.fastq.gz"
##  [93] "trimmed/ERR2730341_1.fastq.gz" "trimmed/ERR2730342_1.fastq.gz"
##  [95] "trimmed/ERR2730344_1.fastq.gz" "trimmed/ERR2730345_1.fastq.gz"
##  [97] "trimmed/ERR2730347_1.fastq.gz" "trimmed/ERR2730351_1.fastq.gz"
##  [99] "trimmed/ERR2730353_1.fastq.gz" "trimmed/ERR2730360_1.fastq.gz"
## [101] "trimmed/ERR2730361_1.fastq.gz" "trimmed/ERR2730362_1.fastq.gz"
## [103] "trimmed/ERR2730363_1.fastq.gz" "trimmed/ERR2730364_1.fastq.gz"
## [105] "trimmed/ERR2730365_1.fastq.gz" "trimmed/ERR2730371_1.fastq.gz"
## [107] "trimmed/ERR2730374_1.fastq.gz" "trimmed/ERR2730375_1.fastq.gz"
## [109] "trimmed/ERR2730376_1.fastq.gz" "trimmed/ERR2730377_1.fastq.gz"
## [111] "trimmed/ERR2730381_1.fastq.gz" "trimmed/ERR2730384_1.fastq.gz"
## [113] "trimmed/ERR2730386_1.fastq.gz" "trimmed/ERR2730387_1.fastq.gz"
## [115] "trimmed/ERR2730389_1.fastq.gz" "trimmed/ERR2730393_1.fastq.gz"
## [117] "trimmed/ERR2730395_1.fastq.gz" "trimmed/ERR2730397_1.fastq.gz"
## [119] "trimmed/ERR2730398_1.fastq.gz" "trimmed/ERR2730400_1.fastq.gz"
## [121] "trimmed/ERR2730402_1.fastq.gz" "trimmed/ERR2730403_1.fastq.gz"
## [123] "trimmed/ERR3040013_1.fastq.gz" "trimmed/ERR3040014_1.fastq.gz"
## [125] "trimmed/ERR3040017_1.fastq.gz" "trimmed/ERR3040018_1.fastq.gz"

fnRs

##   [1] "trimmed/ERR2730149_2.fastq.gz" "trimmed/ERR2730151_2.fastq.gz"
##   [3] "trimmed/ERR2730154_2.fastq.gz" "trimmed/ERR2730155_2.fastq.gz"
##   [5] "trimmed/ERR2730158_2.fastq.gz" "trimmed/ERR2730162_2.fastq.gz"
##   [7] "trimmed/ERR2730163_2.fastq.gz" "trimmed/ERR2730165_2.fastq.gz"
##   [9] "trimmed/ERR2730166_2.fastq.gz" "trimmed/ERR2730170_2.fastq.gz"
##  [11] "trimmed/ERR2730171_2.fastq.gz" "trimmed/ERR2730174_2.fastq.gz"
##  [13] "trimmed/ERR2730178_2.fastq.gz" "trimmed/ERR2730179_2.fastq.gz"
##  [15] "trimmed/ERR2730180_2.fastq.gz" "trimmed/ERR2730181_2.fastq.gz"
##  [17] "trimmed/ERR2730184_2.fastq.gz" "trimmed/ERR2730187_2.fastq.gz"
##  [19] "trimmed/ERR2730189_2.fastq.gz" "trimmed/ERR2730190_2.fastq.gz"
##  [21] "trimmed/ERR2730193_2.fastq.gz" "trimmed/ERR2730194_2.fastq.gz"
##  [23] "trimmed/ERR2730196_2.fastq.gz" "trimmed/ERR2730197_2.fastq.gz"
##  [25] "trimmed/ERR2730198_2.fastq.gz" "trimmed/ERR2730199_2.fastq.gz"
##  [27] "trimmed/ERR2730203_2.fastq.gz" "trimmed/ERR2730204_2.fastq.gz"
##  [29] "trimmed/ERR2730206_2.fastq.gz" "trimmed/ERR2730209_2.fastq.gz"
##  [31] "trimmed/ERR2730211_2.fastq.gz" "trimmed/ERR2730213_2.fastq.gz"
##  [33] "trimmed/ERR2730214_2.fastq.gz" "trimmed/ERR2730216_2.fastq.gz"
##  [35] "trimmed/ERR2730217_2.fastq.gz" "trimmed/ERR2730219_2.fastq.gz"
##  [37] "trimmed/ERR2730223_2.fastq.gz" "trimmed/ERR2730225_2.fastq.gz"
##  [39] "trimmed/ERR2730232_2.fastq.gz" "trimmed/ERR2730233_2.fastq.gz"
##  [41] "trimmed/ERR2730234_2.fastq.gz" "trimmed/ERR2730235_2.fastq.gz"
##  [43] "trimmed/ERR2730236_2.fastq.gz" "trimmed/ERR2730237_2.fastq.gz"
##  [45] "trimmed/ERR2730243_2.fastq.gz" "trimmed/ERR2730246_2.fastq.gz"
##  [47] "trimmed/ERR2730247_2.fastq.gz" "trimmed/ERR2730248_2.fastq.gz"
##  [49] "trimmed/ERR2730249_2.fastq.gz" "trimmed/ERR2730253_2.fastq.gz"
##  [51] "trimmed/ERR2730256_2.fastq.gz" "trimmed/ERR2730258_2.fastq.gz"
##  [53] "trimmed/ERR2730259_2.fastq.gz" "trimmed/ERR2730261_2.fastq.gz"
##  [55] "trimmed/ERR2730265_2.fastq.gz" "trimmed/ERR2730267_2.fastq.gz"
##  [57] "trimmed/ERR2730269_2.fastq.gz" "trimmed/ERR2730270_2.fastq.gz"
##  [59] "trimmed/ERR2730272_2.fastq.gz" "trimmed/ERR2730274_2.fastq.gz"
##  [61] "trimmed/ERR2730275_2.fastq.gz" "trimmed/ERR2730277_2.fastq.gz"
##  [63] "trimmed/ERR2730279_2.fastq.gz" "trimmed/ERR2730282_2.fastq.gz"
##  [65] "trimmed/ERR2730283_2.fastq.gz" "trimmed/ERR2730286_2.fastq.gz"
##  [67] "trimmed/ERR2730290_2.fastq.gz" "trimmed/ERR2730291_2.fastq.gz"
##  [69] "trimmed/ERR2730293_2.fastq.gz" "trimmed/ERR2730294_2.fastq.gz"
##  [71] "trimmed/ERR2730298_2.fastq.gz" "trimmed/ERR2730299_2.fastq.gz"
##  [73] "trimmed/ERR2730302_2.fastq.gz" "trimmed/ERR2730306_2.fastq.gz"
##  [75] "trimmed/ERR2730307_2.fastq.gz" "trimmed/ERR2730308_2.fastq.gz"
##  [77] "trimmed/ERR2730309_2.fastq.gz" "trimmed/ERR2730312_2.fastq.gz"
##  [79] "trimmed/ERR2730315_2.fastq.gz" "trimmed/ERR2730317_2.fastq.gz"
##  [81] "trimmed/ERR2730318_2.fastq.gz" "trimmed/ERR2730321_2.fastq.gz"
##  [83] "trimmed/ERR2730322_2.fastq.gz" "trimmed/ERR2730324_2.fastq.gz"
##  [85] "trimmed/ERR2730325_2.fastq.gz" "trimmed/ERR2730326_2.fastq.gz"
##  [87] "trimmed/ERR2730327_2.fastq.gz" "trimmed/ERR2730331_2.fastq.gz"
##  [89] "trimmed/ERR2730332_2.fastq.gz" "trimmed/ERR2730334_2.fastq.gz"
##  [91] "trimmed/ERR2730337_2.fastq.gz" "trimmed/ERR2730339_2.fastq.gz"
##  [93] "trimmed/ERR2730341_2.fastq.gz" "trimmed/ERR2730342_2.fastq.gz"
##  [95] "trimmed/ERR2730344_2.fastq.gz" "trimmed/ERR2730345_2.fastq.gz"
##  [97] "trimmed/ERR2730347_2.fastq.gz" "trimmed/ERR2730351_2.fastq.gz"
##  [99] "trimmed/ERR2730353_2.fastq.gz" "trimmed/ERR2730360_2.fastq.gz"
## [101] "trimmed/ERR2730361_2.fastq.gz" "trimmed/ERR2730362_2.fastq.gz"
## [103] "trimmed/ERR2730363_2.fastq.gz" "trimmed/ERR2730364_2.fastq.gz"
## [105] "trimmed/ERR2730365_2.fastq.gz" "trimmed/ERR2730371_2.fastq.gz"
## [107] "trimmed/ERR2730374_2.fastq.gz" "trimmed/ERR2730375_2.fastq.gz"
## [109] "trimmed/ERR2730376_2.fastq.gz" "trimmed/ERR2730377_2.fastq.gz"
## [111] "trimmed/ERR2730381_2.fastq.gz" "trimmed/ERR2730384_2.fastq.gz"
## [113] "trimmed/ERR2730386_2.fastq.gz" "trimmed/ERR2730387_2.fastq.gz"
## [115] "trimmed/ERR2730389_2.fastq.gz" "trimmed/ERR2730393_2.fastq.gz"
## [117] "trimmed/ERR2730395_2.fastq.gz" "trimmed/ERR2730397_2.fastq.gz"
## [119] "trimmed/ERR2730398_2.fastq.gz" "trimmed/ERR2730400_2.fastq.gz"
## [121] "trimmed/ERR2730402_2.fastq.gz" "trimmed/ERR2730403_2.fastq.gz"
## [123] "trimmed/ERR3040013_2.fastq.gz" "trimmed/ERR3040014_2.fastq.gz"
## [125] "trimmed/ERR3040017_2.fastq.gz" "trimmed/ERR3040018_2.fastq.gz"

sample.names

##   [1] "ERR2730149" "ERR2730151" "ERR2730154" "ERR2730155" "ERR2730158" "ERR2730162"
##   [7] "ERR2730163" "ERR2730165" "ERR2730166" "ERR2730170" "ERR2730171" "ERR2730174"
##  [13] "ERR2730178" "ERR2730179" "ERR2730180" "ERR2730181" "ERR2730184" "ERR2730187"
##  [19] "ERR2730189" "ERR2730190" "ERR2730193" "ERR2730194" "ERR2730196" "ERR2730197"
##  [25] "ERR2730198" "ERR2730199" "ERR2730203" "ERR2730204" "ERR2730206" "ERR2730209"
##  [31] "ERR2730211" "ERR2730213" "ERR2730214" "ERR2730216" "ERR2730217" "ERR2730219"
##  [37] "ERR2730223" "ERR2730225" "ERR2730232" "ERR2730233" "ERR2730234" "ERR2730235"
##  [43] "ERR2730236" "ERR2730237" "ERR2730243" "ERR2730246" "ERR2730247" "ERR2730248"
##  [49] "ERR2730249" "ERR2730253" "ERR2730256" "ERR2730258" "ERR2730259" "ERR2730261"
##  [55] "ERR2730265" "ERR2730267" "ERR2730269" "ERR2730270" "ERR2730272" "ERR2730274"
##  [61] "ERR2730275" "ERR2730277" "ERR2730279" "ERR2730282" "ERR2730283" "ERR2730286"
##  [67] "ERR2730290" "ERR2730291" "ERR2730293" "ERR2730294" "ERR2730298" "ERR2730299"
##  [73] "ERR2730302" "ERR2730306" "ERR2730307" "ERR2730308" "ERR2730309" "ERR2730312"
##  [79] "ERR2730315" "ERR2730317" "ERR2730318" "ERR2730321" "ERR2730322" "ERR2730324"
##  [85] "ERR2730325" "ERR2730326" "ERR2730327" "ERR2730331" "ERR2730332" "ERR2730334"
##  [91] "ERR2730337" "ERR2730339" "ERR2730341" "ERR2730342" "ERR2730344" "ERR2730345"
##  [97] "ERR2730347" "ERR2730351" "ERR2730353" "ERR2730360" "ERR2730361" "ERR2730362"
## [103] "ERR2730363" "ERR2730364" "ERR2730365" "ERR2730371" "ERR2730374" "ERR2730375"
## [109] "ERR2730376" "ERR2730377" "ERR2730381" "ERR2730384" "ERR2730386" "ERR2730387"
## [115] "ERR2730389" "ERR2730393" "ERR2730395" "ERR2730397" "ERR2730398" "ERR2730400"
## [121] "ERR2730402" "ERR2730403" "ERR3040013" "ERR3040014" "ERR3040017" "ERR3040018"

Inspect read quality profiles

We use the plotQualityProfile function to plot the quality profiles of the trimmed fastq files.

# Plot quality profile of fastq files
ii = 1:length(sample.names)
pdf(paste0(images, "/plotQualityProfile.pdf"), width = 8, height = 8, pointsize = 12)
for(i in ii) {
    message(paste0("[", i ,"/", length(sample.names), "] ", sample.names[i]))
    print(plotQualityProfile(fnFs[i]) + ggtitle("Fwd"))
    print(plotQualityProfile(fnRs[i]) + ggtitle("Rev"))
}
invisible(dev.off())

We review the quality profiles, the quality distribution of forward reads appears to drop after position 260 and position 200 for reverse reads. We will use these positions to truncate forward and reverse reads respectively in the next step.

Filter and trim

Remember to review plotQualityProfile.pdf to select the best paramters for truncLen argument.

# Set paths to the dada2-filterd files
filtFs = file.path(filt, basename(fnFs))
filtRs = file.path(filt, basename(fnRs))

# Perform filtering and trimming
out = filterAndTrim(fnFs, filtFs, fnRs, filtRs,
        # Need to keep paramters consistent between runs of the same study
        truncLen = c(260,200), minLen = 200, maxN = 0, truncQ = 2, maxEE = c(2,5),
        rm.phix = TRUE, compress = TRUE, verbose = TRUE, multithread = TRUE)

out = as.data.frame(out)
rownames(out) = sample.names

head(out, 10)

##            reads.in reads.out
## ERR2730149    65747     58829
## ERR2730151    70743     61547
## ERR2730154    91558     81374
## ERR2730155    58955     53269
## ERR2730158   128246    116847
## ERR2730162   134565    121999
## ERR2730163   116621    105093
## ERR2730165    57401     51513
## ERR2730166   121037    108838
## ERR2730170    57359     51276

Learn the Error Rates

Use the learnErrors function to perform dereplication and learn the error rates. The derepFastq function used in past workflow has been intergrated into learnErrors function.

errF = learnErrors(filtFs, multithread = TRUE)

## 128404900 total bases in 493865 reads from 6 samples will be used for learning the error rates.

errR = learnErrors(filtRs, multithread = TRUE)

## 119791600 total bases in 598958 reads from 7 samples will be used for learning the error rates.

# Visualize the estimated error rates
pdf(paste0(images, "/plotErrors.pdf"), width = 10, height = 10, pointsize = 12)
plotErrors(errF, nominalQ = TRUE)
plotErrors(errR, nominalQ = TRUE)
invisible(dev.off())

Forward reads

Reverse reads

Sample Inference

We now apply the core sample inference algorithm to the filtered and trimmed sequence data.

Note: By default, the dada function processes each sample independently (i.e. pool = FALSE), if your samples are from an extremely diverse community (e.g. soil), pooling (i.e. pool = TRUE) or pseudo-pooling (recommended; i.e. pool = pseudo) might help in identifying the rare ASVs in each sample

dadaFs = dada(filtFs, err = errF, pool = FALSE, multithread = TRUE)
dadaRs = dada(filtRs, err = errR, pool = FALSE, multithread = TRUE)

Merge paired reads

We now merge the forward and reverse reads together to obtain the full denoised sequences.

mergers = mergePairs(dadaFs, filtFs, dadaRs, filtRs, verbose = TRUE)

Construct sequence table

We now construct an amplicon sequence variant (ASV) table.

seqtab = makeSequenceTable(mergers)

View the length frequency distribution with table.

table(nchar(getSequences(seqtab)))

## 
##   260   261   264   267   270   272   273   275   277   278   279   281   282   284   286 
##   461     7     1     1     7     3     7     2    10     7    11    14     1     1     1 
##   288   300   301   313   317   320   325   335   341   365   384   385   386   400   401 
##     1     2     1     1     2     1     2     1     1     1     4     2     3     2   724 
##   402   403   404   405   406   407   408   409   410   411   412   413   417   418   419 
## 11302  2584  4587  4221   126   257   112   124    34   264    13     1     2     8   101 
##   420   421   422   423   424   425   426   427   428   429   430   431   432   433   438 
##    67   531  3606   326    82    68   206  1082   452     7    12     1     8     3     1 
##   439   440   441   442   443   444   445   446   447   448 
##    24     7    13    10     9    10     4     6     5     7

Save sequence table.

saveRDS(seqtab, "seqtab.rds")

# Or as an example, save only the first 5 samples
# saveRDS(seqtab[c(1:5),], "seqtab.rds")

*Merge multiple runs

This section elaborates a little about merging results from multiple sequencing run. You can merge sequence tables from multiple runs belonging to the same experiment or project before moving on to chimera removal and taxonomy assignment. In the example snippet below, we use mergeSequenceTables to combine 2 sequence tables.

Note: Skip this step if you do not require merging multiple sequence tables

# Load sequence tables from multiple runs
seqtab1 = readRDS("/some-path/run1/seqtab.rds")
seqtab2 = readRDS("/some-path/run2/seqtab.rds")

# Check the dimensions and sample names
dim(seqtab1)
rownames(seqtab1)
dim(seqtab2)
rownames(seqtab2)

# Merge sequence table and remove chimeras
seqtab = mergeSequenceTables(seqtab1, seqtab2)

# To sum values of same sample from multiple sequence table (i.e. when a sample was re-sequenced due to low depth)
# seqtab = mergeSequenceTables(seqtab1, seqtab2, repeats = "sum")

# Check the dimension of the merged sequence table
dim(seqtab)

After obtaining the merged sequence table, you can continue with the below data processing steps.

Remove chimeras

seqtab.nochim = removeBimeraDenovo(seqtab, method = "consensus", multithread = TRUE, 
                   verbose = TRUE)

## Identified 23851 bimeras out of 31565 input sequences.

# Update the same names to exclude "fastq.gz" in name
rownames(seqtab.nochim) = sample.names

View the dimensions of seqtab and seqtab.nochim.

dim(seqtab)

## [1]   126 31565

dim(seqtab.nochim)

## [1]  126 7714

Print the proportion of non-chimeras in merged sequence reads.

sum(seqtab.nochim)/sum(seqtab)

## [1] 0.9683915

Track reads through the pipeline

getN <- function(x) sum(getUniques(x))

track = cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), 
          sapply(mergers, getN), rowSums(seqtab.nochim))
colnames(track) = c("Trimmed", "Filtered", "denoisedF", "denoisedR", "merged", "nonchim")
track = cbind(data.frame(SampleID = sample.names), track)

write.table(track, file = paste0(outfiles, "/track.txt"), sep = "\t", quote = F, 
        row.names = F, col.names = T)

Assign taxonomy

Set the full-path to the Silva and NCBI database.

dbpath = "/path-to-db/"
ref1 = paste0(dbpath, "silva_nr99_v138.1_train_set.fa.gz")
ref2 = paste0(dbpath, "silva_species_assignment_v138.1.fa.gz")
ref3 = paste0(dbpath, "16SMicrobial.fa.gz")

Use the assignTaxonomy function to classifies sequences against SILVA reference training dataset ref1, and use the assignSpecies function to perform taxonomic assignment to the species level by exact matching against SILVA ref2 and NCBI reference datasets ref3.

# Extracts the sequences
seqs = getSequences(seqtab.nochim)

set.seed(12345)
taxtab = assignTaxonomy(seqs, refFasta = ref1, minBoot = 80, tryRC = TRUE, verbose = TRUE, 
            outputBootstraps = TRUE, multithread = TRUE)

## Finished processing reference fasta.

spec_silva = assignSpecies(seqs, ref2, allowMultiple = FALSE, tryRC = TRUE, verbose = TRUE)

## 473 out of 7714 were assigned to the species level.

spec_ncbi = assignSpecies(seqs, ref3, allowMultiple = FALSE, tryRC = TRUE, verbose = TRUE)

## 355 out of 7714 were assigned to the species level.

Combine species-level taxonomic assignment from 2 reference sources.

SVformat = paste("%0",nchar(as.character(ncol(seqtab.nochim))),"d", sep = "")
svid = paste0("SV", sprintf(SVformat, seq(ncol(seqtab.nochim))))

s_silva = as.data.frame(spec_silva, stringsAsFactors = FALSE)
rownames(s_silva) = svid

s_ncbi = as.data.frame(spec_ncbi, stringsAsFactors = FALSE)
rownames(s_ncbi) = svid
s_ncbi$Genus = gsub("\\[|\\]", "", s_ncbi$Genus)

s_merged = cbind(s_ncbi, s_silva)
colnames(s_merged) = c("nGenus","nSpecies","sGenus","sSpecies")
s_merged1 = s_merged[!is.na(s_merged$nSpecies),]
colnames(s_merged1)[1:2] = c("Genus","Species")
s_merged2 = s_merged[is.na(s_merged$nSpecies) & !is.na(s_merged$sSpecies),]
colnames(s_merged2)[3:4] = c("Genus","Species")
s_merged3 = s_merged[is.na(s_merged$nSpecies) & is.na(s_merged$sSpecies),]
colnames(s_merged3)[3:4] = c("Genus","Species")

s_final = rbind(s_merged1[,c("Genus","Species")], s_merged2[,c("Genus","Species")], 
        s_merged3[,c("Genus","Species")])
s_final = s_final[order(row.names(s_final)),]

s_final = as.matrix(s_final)

if("Genus" %in% colnames(taxtab$tax)) {
    gcol = which(colnames(taxtab$tax) == "Genus")
} else {
    gcol = ncol(taxtab$tax)
}

matchGenera <- function(gen.tax, gen.binom, split.glyph = "/") {
    if(is.na(gen.tax) || is.na(gen.binom)) { return(FALSE) }
    if((gen.tax == gen.binom) || 
       grepl(paste0("^", gen.binom, "[ _", split.glyph, "]"), gen.tax) || 
       grepl(paste0(split.glyph, gen.binom, "$"), gen.tax)) {
        return(TRUE)
    } else {
        return(FALSE)
    }
}

gen.match = mapply(matchGenera, taxtab$tax[,gcol], s_final[,1])
taxtab$tax = cbind(taxtab$tax, s_final[,2])
colnames(taxtab$tax)[ncol(taxtab$tax)] = "Species"

print(paste(sum(!is.na(s_final[,2])), "out of", 
        nrow(s_final), "were assigned to the species level."))

## [1] "587 out of 7714 were assigned to the species level."

taxtab$tax[!gen.match,"Species"] = NA
print(paste("Of which", sum(!is.na(taxtab$tax[,"Species"])), 
        "had genera consistent with the input table."))

## [1] "Of which 476 had genera consistent with the input table."

Multiple sequence alignment

Prepare a data.frame df from seqtab.nochim and taxtab.

df = data.frame(sequence = seqs, abundance = colSums(seqtab.nochim), stringsAsFactors = FALSE)
df$id = svid
df = merge(df, as.data.frame(taxtab), by = "row.names")
rownames(df) = df$id
df = df[order(df$id),2:ncol(df)]

Performs alignment of multiple unaligned sequences.

alignment = AlignSeqs(Biostrings::DNAStringSet(setNames(df$sequence, df$id)), anchor = NA)

## Determining distance matrix based on shared 8-mers:
## ==========================================================================================
## 
## Time difference of 479.97 secs
## 
## Clustering into groups by similarity:
## ==========================================================================================
## 
## Time difference of 7.04 secs
## 
## Aligning Sequences:
## ==========================================================================================
## 
## Time difference of 85.3 secs
## 
## Iteration 1 of 2:
## 
## Determining distance matrix based on alignment:
## ==========================================================================================
## 
## Time difference of 76.33 secs
## 
## Reclustering into groups by similarity:
## ==========================================================================================
## 
## Time difference of 5.85 secs
## 
## Realigning Sequences:
## ==========================================================================================
## 
## Time difference of 61.57 secs
## 
## Iteration 2 of 2:
## 
## Determining distance matrix based on alignment:
## ==========================================================================================
## 
## Time difference of 66.86 secs
## 
## Reclustering into groups by similarity:
## ==========================================================================================
## 
## Time difference of 5.87 secs
## 
## Realigning Sequences:
## ==========================================================================================
## 
## Time difference of 14.13 secs
## 
## Refining the alignment:
## ==========================================================================================
## 
## Time difference of 0.92 secs

Export alignment.

phang.align = phyDat(as(alignment, "matrix"), type = "DNA")
write.phyDat(phang.align, file = "alignment.fasta", format = "fasta")
write.phyDat(phang.align, file = "alignment.aln", format = "phylip")

Construct phylogenetic tree

Set the full-path to the RAxML and RAxML-NG.

raxml = "/path-to-raxml-binary"     # e.g. /usr/local/bin/raxmlHPC-PTHREADS-SSE3
raxmlng = "/path-to-raxml-ng-binary"    # e.g. /usr/local/bin/raxml-ng

Run RAxML and RAxML-NG.

system2(raxml, args = c("-T 2", "-f E", "-p 1234", "-x 5678", "-m GTRCAT", "-N 1", 
            "-s alignment.aln", "-n raxml_tree_GTRCAT"))

system2(raxmlng, args = c("--evaluate", "--force", "--seed 1234", "--log progress", "--threads 2", 
              "--msa alignment.fasta",  "--model GTR+G", "--brlen scaled", 
              "--tree RAxML_fastTree.raxml_tree_GTRCAT", "--prefix GTRCAT"))

Import tree using the read_tree function from phyloseq.

raxml_tree = read_tree("GTRCAT.raxml.bestTree")

Load sample information

samdf = data.frame(fread("sample.meta", colClasses = "character"))

rownames(samdf) = samdf$Sample_ID
samdf$Sample_ID = as.factor(samdf$Sample_ID)
samdf$Sample_ID = factor(samdf$Sample_ID, levels = c(sort(levels(samdf$Sample_ID), decreasing = F)))
samdf$Subject = as.factor(samdf$Subject)
samdf$Group2 = paste(samdf$Group, samdf$Time, sep = "_")
samdf$Group = as.factor(samdf$Group)
samdf$Group2 = as.factor(samdf$Group2)
samdf$Time = as.factor(samdf$Time)

head(samdf)

##            Run_ID Sample_ID   Sample_title Subject   Group     Time           Group2
## C0005B ERR2730149    C0005B C0005 baseline   C0005 control baseline control_baseline
## C0009B ERR2730151    C0009B C0009 baseline   C0009 control baseline control_baseline
## C0019B ERR2730154    C0019B C0019 baseline   C0019 control baseline control_baseline
## C0020B ERR2730155    C0020B C0020 baseline   C0020 control baseline control_baseline
## C0024B ERR2730158    C0024B C0024 baseline   C0024 control baseline control_baseline
## C0032B ERR2730162    C0032B C0032 baseline   C0032 control baseline control_baseline

Handoff to `phyloseq`

Prepare new_seqtab and tax data.frames containing abundance and taxonomy information respectively.

new_seqtab = seqtab.nochim
colnames(new_seqtab) = df[match(colnames(new_seqtab), df$sequence),]$id

# Update rownames in new_seqtab from Run_ID to Sample_ID
rownames(new_seqtab) = as.character(samdf[match(rownames(new_seqtab), samdf$Run_ID),]$Sample_ID)

new_taxtab = taxtab
rownames(new_taxtab$tax) = df[match(rownames(new_taxtab$tax), df$sequence),]$id

tax = as.data.frame(new_taxtab$tax)
tax$Family = as.character(tax$Family)
tax$Genus = as.character(tax$Genus)

Construct a phyloseq object.

ps = phyloseq(tax_table(as.matrix(tax)), 
          sample_data(samdf), 
          otu_table(new_seqtab, taxa_are_rows = FALSE), 
          phy_tree(raxml_tree))

ps

## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 7714 taxa and 126 samples ]
## sample_data() Sample Data:       [ 126 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 7714 taxa by 7 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 7714 tips and 7712 internal nodes ]

Save current workspace

save.image(file = "1_dada2_tutorial.RData")

Session information

sessionInfo()

## R version 4.1.0 (2021-05-18)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
## 
## Matrix products: default
## BLAS/LAPACK: /home/ihsuan/miniconda3_4.10.0/envs/r4-16S/lib/libopenblasp-r0.3.15.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C               LC_TIME=en_GB.UTF-8       
##  [4] LC_COLLATE=en_GB.UTF-8     LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
## [10] LC_TELEPHONE=C             LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods  
## [9] base     
## 
## other attached packages:
##  [1] ggplot2_3.3.4       phyloseq_1.36.0     phangorn_2.7.0      ape_5.5            
##  [5] DECIPHER_2.20.0     RSQLite_2.2.5       Biostrings_2.60.0   GenomeInfoDb_1.28.0
##  [9] XVector_0.32.0      IRanges_2.26.0      S4Vectors_0.30.0    BiocGenerics_0.38.0
## [13] data.table_1.14.0   dada2_1.20.0        Rcpp_1.0.6          knitr_1.33         
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-152                bitops_1.0-7                matrixStats_0.59.0         
##  [4] bit64_4.0.5                 RColorBrewer_1.1-2          tools_4.1.0                
##  [7] vegan_2.5-7                 utf8_1.2.1                  R6_2.5.0                   
## [10] mgcv_1.8-36                 DBI_1.1.1                   colorspace_2.0-2           
## [13] permute_0.9-5               rhdf5filters_1.4.0          ade4_1.7-17                
## [16] withr_2.4.2                 gridExtra_2.3               tidyselect_1.1.1           
## [19] bit_4.0.4                   compiler_4.1.0              Biobase_2.52.0             
## [22] DelayedArray_0.18.0         labeling_0.4.2              scales_1.1.1               
## [25] quadprog_1.5-8              stringr_1.4.0               digest_0.6.27              
## [28] Rsamtools_2.8.0             rmarkdown_2.9               jpeg_0.1-8.1               
## [31] pkgconfig_2.0.3             htmltools_0.5.1.1           MatrixGenerics_1.4.0       
## [34] highr_0.9                   fastmap_1.1.0               rlang_0.4.11               
## [37] farver_2.1.0                generics_0.1.0              hwriter_1.3.2              
## [40] jsonlite_1.7.2              BiocParallel_1.26.0         dplyr_1.0.7                
## [43] RCurl_1.98-1.3              magrittr_2.0.1              GenomeInfoDbData_1.2.6     
## [46] biomformat_1.20.0           Matrix_1.3-4                Rhdf5lib_1.14.0            
## [49] munsell_0.5.0               fansi_0.4.2                 lifecycle_1.0.0            
## [52] stringi_1.6.2               yaml_2.2.1                  MASS_7.3-54                
## [55] SummarizedExperiment_1.22.0 zlibbioc_1.38.0             rhdf5_2.36.0               
## [58] plyr_1.8.6                  grid_4.1.0                  blob_1.2.1                 
## [61] crayon_1.4.1                lattice_0.20-44             splines_4.1.0              
## [64] multtest_2.48.0             pillar_1.6.1                igraph_1.2.6               
## [67] GenomicRanges_1.44.0        reshape2_1.4.4              codetools_0.2-18           
## [70] fastmatch_1.1-0             glue_1.4.2                  evaluate_0.14              
## [73] ShortRead_1.50.0            latticeExtra_0.6-29         RcppParallel_5.1.4         
## [76] png_0.1-7                   vctrs_0.3.8                 foreach_1.5.1              
## [79] gtable_0.3.0                purrr_0.3.4                 assertthat_0.2.1           
## [82] cachem_1.0.5                xfun_0.24                   survival_3.2-11            
## [85] tibble_3.1.2                iterators_1.0.13            GenomicAlignments_1.28.0   
## [88] memoise_2.0.0               cluster_2.1.2               ellipsis_0.3.2

16S rDNA amplicon sequencing analysis using R (Part 1: cutadapt & DADA2)

I-Hsuan Lin

July 19, 2021

Primer trimming

Start `R`

Load package and set path

*Primer trimming in R

Build trimmed file lists

Inspect read quality profiles

Filter and trim

Learn the Error Rates

Sample Inference

Merge paired reads

Construct sequence table

*Merge multiple runs

Remove chimeras

Track reads through the pipeline

Assign taxonomy

Multiple sequence alignment

Construct phylogenetic tree

Load sample information

Handoff to `phyloseq`

Save current workspace

Session information

16S rDNA amplicon sequencing analysis using R (Part 1: cutadapt & DADA2)

I-Hsuan Lin

July 19, 2021

Primer trimming

Start R

Load package and set path

*Primer trimming in R

Build trimmed file lists

Inspect read quality profiles

Filter and trim

Learn the Error Rates

Sample Inference

Merge paired reads

Construct sequence table

*Merge multiple runs

Remove chimeras

Track reads through the pipeline

Assign taxonomy

Multiple sequence alignment

Construct phylogenetic tree

Load sample information

Handoff to phyloseq

Save current workspace

Session information

Start `R`

Handoff to `phyloseq`