fastq数据统计

FASTQ Summary Statistics(fastq数据统计)


  分析模块输入,原始测序数据文件,FASTQ格式,生成测序结果统计报告。测序结果统计报告,可用于生成碱基分布图,和测序质量箱线图。

  为方便测序数据的分析、发布和共享,Illumina测序得到的原始图像数据经过Base Calling转化为序列数据,即FASTQ格式的测序数据文件。FASTQ格式文件可记录所测读段(read)的碱基及其质量分数。


  输入:

       1、FASTQ格式的原始测序数据文件:

15.gif

      2、设置原始数据质量值编码方式:1、Solexa    2、Illumina 1.3-1.7    3、Sanger/Illumina 1.8+

注:对于Illumina测序平台,从左到右分别对应,极早期机器、早期机器,当前或以后机器。参数的默认值为 Sanger/Illumina 1.8+ 对应目前国内测序公司主流测序仪。

关于FASTQ格式文件,请参考维基百科:https://en.wikipedia.org/wiki/FASTQ_format


  输出:

  每个测序循环下所有reads碱基数和质量值统计表:

1   6362991 -4 40 250734117 39.41 40 40 40   0 40 40 1396976 1329101   678730 2958184   0

2   6362991 -5 40 250531036 39.37 40 40 40   0 40 40 1786786 1055766 1738025 1782414   0

3   6362991 -5 40 248722469 39.09 40 40 40   0 40 40 2296384   984875 1443989 1637743   0

4   6362991 -4 40 248214827 39.01 40 40 40   0 40 40 2536861 1167423 1248968 1409739   0

36   6362991 -5 40 117158566 18.41   7 15 30 23 -5 40 4074444 1402980   63287   822035 245


其中,从左到右,每一列含义,如下所示:

column = column number (1 to 36 for a 36-cycles read file)

count = number of bases found in this column.

min = Lowest quality score value found in this column.

max = Highest quality score value found in this column.

sum = Sum of quality score values for this column.

mean = Mean quality score value for this column.

Q1 = 1st quartile quality score.

med = Median quality score.

Q3 = 3rd quartile quality score.

IQR = Inter-Quartile range (Q3-Q1).

lW = 'Left-Whisker' value (for boxplotting).

rW = 'Right-Whisker' value (for boxplotting).

A_Count = Count of 'A' nucleotides found in this column.

C_Count = Count of 'C' nucleotides found in this column.

G_Count = Count of 'G' nucleotides found in this column.

T_Count = Count of 'T' nucleotides found in this column.

N_Count = Count of 'N' nucleotides found in this column.



分享