分析模块封装了Trimmomatic工具,Trimmomatic是一个针对Illumina高通量测序的reads trim工具,支持paired-end(双末端)和single-end(单末端)数据。
Trimmomatic包括如下功能:
l ILLUMINACLIP: Cut adapter and other illumina-specific sequences from the read
l SLIDINGWINDOW: Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold
l MINLEN: Drop the read if it is below a specified length
l LEADING: Cut bases off the start of a read, if below a threshold quality
l TRAILING: Cut bases off the end of a read, if below a threshold quality
l CROP: Cut the read to a specified length
l HEADCROP: Cut the specified number of bases from the start of the read
输入:
对于single-end(单末端)数据,输入单个FASTQ文件。
对于paired-end(双末端)数据,输入两个FASTQ文件(R1和R2)。
设置质量值参数,Illumina 1.3-1.7 Phred+64 对应Illumina早期平台,Illumina 1.8+ Phred+33 对应Illumina最新平台,默认参数为:Illumina 1.8+ Phred+33。
输出:
对于single-end(单末端)数据,输出修剪和过滤的clean data数据,为单个FASTQ文件。
对于paired-end(双末端)数据,输出四个文件,分别为:
两个FASTQ文件(R1-paired and R2-paired),包含read的两端pair(R1和R2)均通过数据质控的结果文件。
额外的两个FASTQ文件(R1-unpaired and R2-unpaired),包含read,其中一端pair(R1 或 R2)通过数据质控,另一端无法通过数据质控,这样,就仅保留了一端的数据结果。
附录:
对于常规的RNA或DNA测序,HiSeq4000或HiSeqXTen平台,PE100或PE150,建议使用如下参数设置:
Perform initial ILLUMINACLIP step:Yes
Maximum mismatch count which will still allow a full match to be performed:2
How accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment:30
How accurate the match between any adapter etc. sequence must be against a read:10
Perform Sliding window trimming (SLIDINGWINDOW):Yes
Number of bases to average across:20
Average quality required:20
Drop reads below a specified length (MINLEN):Yes
Minimum length of reads to be kept:35
Cut bases off the end of a read, if below a threshold quality (TRAILING):Yes
Minimum quality required to keep a base:20
即,去接头污染,比对允许的最大错配数为2,palindrome模式下匹配碱基数阈值为30,simple模式下的匹配碱基数阈值为10。过滤read尾部质量值20以下的碱基,设置20bp的窗口,如果窗口内的平均质量值低于20,从窗口开始截去后端碱基,过滤质控后35bp以下的read。
对于扩增子测序,MiSeq PE 250,建议使用如下参数设置:
Perform Sliding window trimming (SLIDINGWINDOW):Yes
Number of bases to average across:50
Average quality required:20
Drop reads below a specified length (MINLEN):Yes
Minimum length of reads to be kept:50
Cut bases off the end of a read, if below a threshold quality (TRAILING):Yes
Minimum quality required to keep a base:20
即,过滤read尾部质量值20以下的碱基,设置50bp的窗口,如果窗口内的平均质量值低于20,从窗口开始截去后端碱基,过滤质控后50bp以下的read。
分析模块引用了Trimmomatic v0.32 软件( http://www.usadellab.org/cms/index.php?page=trimmomatic)。
相关文献如下所示:
Bolger, A.M., Lohse, M., & Usadel, B. (2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170.