吴婷婷at Macrogen千年基因分享 http://blog.sciencenet.cn/u/alinatingting /NGS/next generation sequencing/PacBio RS II sequencing/

博文

关于PacBio RS II reads analysis名词释义(三)

已有 16667 次阅读 2015-3-25 11:32 |个人分类:PacBio RS II平台数据分析|系统分类:科研笔记|关键词:学者| Analysis, primary, reads, PacBio, analys

 

   针对老师们问到的三代数据分析中的一些问题,今天主要针对基本信息分析中的测序数据统计、质量QC评估,data summary等,结合项目案例解释如下:



General - Filtering Report


* Polymerase Read Bases : The number of bases in the polymerase read.

即测序获得所有数据量,包含adaptors序列。

* Polymerase Reads : The number of polymerases generating high quality reads. Polymerase reads are trimmed to the high quality region and include bases from adaptors, as well as potentially multiple passes around a circular template.

即高质量测序reads,包含adaptors以及测多次获得multiple subreads。

* Polymerase Read N50 : 50% of all polymerase reads are longer than this value.

测序reads中,50%的reads长度大于N50这个值。

* Polymerase Read Length : The mean trimmed read length of all polymerase reads. The value includes bases from adaptors as well as multiple passes around a circular template.

测序reads的平均长度,包含adaptors以及multiple subreads。

* Polymerase Read Quality : The mean single-pass read quality of all polymerase reads.

测序reads中,single-pass read平均质量值

* Post-Filter Polymerase Read Bases : The number of bases in the polymerase reads after filtering, including adaptors.

测序reads过滤后所包含的碱基数,包含adaptors以及multiple subreads

* Post-Filter Polymerase Reads : The number of polymerases generating trimmed reads after filtering. Polymerase reads include bases from adaptors and multiple passes around a circular template.

过滤后测序reads数,过滤后reads中包含adaptors以及multiple subreads

* Post-Filter Polymerase Read Length : The mean trimmed read length of all polymerase reads after filtering. The value includes bases from adaptors as well as multiple passes around a circular template.

过滤后测序reads的平均长度,过滤后reads中包含adaptors以及multiple subreads

* Post-Filter Polymerase Read Quality : The mean single-pass read quality of all polymerase reads after filtering.

过滤后测序reads中,single-pass read平均质量值


附其他输出报告中的名词释义

Diagnostic - Adapters Report

  • Adapter Dimers (%): The % of pre-filter ZMWs which have observed inserts of 0-10bp. These are likely adapter dimers.

    接头二聚体(%): 测序reads过滤前,其中0-10bp的序列,极有可能为接头二聚体。

  • Short Inserts (%): The % of pre-filter ZMWs which have observed inserts of 11-100bp. These are likely short fragment contamination.

    短的插入片段(%): 测序reads过滤前,其中11-100bp的序列,极有可能为短的污染序列。

Diagnostic - Spike-In Control Report

  • Control Sequence: The name of the control sequence.

    对照序列/样本的信息。

  • Control Reads (%): The percent of post-filter polymerase reads that are from the control sample. The formula for this is: (total # of control reads)/(total # of post-filter reads).

    测序reads过滤后,control reads所占过滤后reads的比例。计算公式为: (total # of control reads)/(total # of post-filter reads).

  • Control Polymerase Read Length: The mean mapped read length of the polymerase reads from the control sample.

    对照样品测序reads中,可比对上的reads的平均长度。

  • Control Reads: The total number of polymerase reads from the control sample that passed filtering.

    经过滤后,对照样本中总的测序reads数。

  • Control Subread Accuracy: The mean single-pass accuracy of the mapped polymerase reads from the control sample.

    对照样本中,可比对上的测序reads的平均single-pass准确性。

  • Control Polymerase Read Length 95%: The 95th percentile of mapped read length of the polymerase reads from the control sample.

    对照样本中,比对率在95%的reads长度。

Diagnostic - Loading Report

  • SMRT Cell ID: ID number of the SMRT Cell(s) used in this run.

    此次运行中,SMRT Cell(s)的ID号。

  • Productive ZMWs: The number of ZMWs for this SMRT Cell that produced results with Productivity = 1.

    此测序SMRT cell中,零膜波导孔测序产生的序列结果,且聚合酶填充率Productivity = 1。

  • Productivity 0 (%): Percentage of ZMWs that are empty, with no polymerase.

    零膜波导孔没有被聚合酶填充,是空的。

  • Productivity 1 (%): Percentage of ZMWs that are productive and sequencing.

    零膜波导孔被聚合酶填充满,可开展测序。

  • Productivity 2 (%): Percentage of ZMWs that are not P0 (empty) or P1 (productive). This may occur for a variety of reasons and the sequence data is not usable.

    零膜波导孔填充值既不是P0 (empty) 也不是 P1 (productive)。这可能是由多方面的原因导致的、且测序数据不可用。

Resequencing - Coverage Report

  • Coverage: The mean depth of coverage across the reference sequence.

    总测序数据量相对参考基因组序列的平均覆盖度(平均测序深度)。

  • Missing Bases (%): The percentage of the reference sequence that has zero coverage.

    参考基因组序列中完全没有被覆盖到的区域,即该区域测序深度为0。

Resequencing - Mapping Report

  • Post-Filter Reads: The number of reads that passed filtering.

    过滤后的reads数。

  • Mapped Reads: The number of post-filter reads that mapped to the reference sequence.

    过滤后的reads中,可比对至参考基因组序列上的reads数。

  • Mapped Subreads: The number of post-filter subreads that mapped to the reference sequence.

    过滤后获得的subreads中,可比对至参考基因组序列上的subreads数。

  • Mapped CCS Reads: The number of post-filter CCS reads that mapped to the reference sequence.

    CCS即为consensus sequence,由来自同一个ZMWs的subreads比对获得。

    这里是指过滤后,可比对至参考基因组序列上的CCS序列数。

  • Mapped Subread Bases: The number of post-filter bases from all subreads that mapped to the reference sequence. This does not include adapters.

    过滤后,可比对至参考基因组序列上的subreads的总碱基数。这里不包含adapters。

  • Mapped CCS Read Bases: The number of post-filter CCS read bases that mapped to the reference sequence. This does not include adapters.

    过滤后,可比对至参考基因组序列上的CCS的总碱基数。这里不包含adapters。

  • Mapped Subread Accuracy: The mean accuracy of post-filter subreads that mapped to the reference sequence.

    过滤后,可比对至参考基因组序列上的subreads的平均准确性。

  • Mapped CCS Read Accuracy: The mean accuracy of post-filter CCS reads that mapped to the reference sequence.

    过滤后,可比对至参考基因组序列上的CCS的平均准确性。

  • Mapped Subread Length: The mean read length of post-filter subreads that mapped to the reference sequence. This does not include adapters.

    过滤后,可比对至参考基因组序列上的subreads的平均长度。这里不包含adapters。

  • Mapped Read Length of Insert: The mean read length of all insert sequences, which includes only mapped sequences. The read length of insert is approximately the longest subread length per ZMW.

    过滤后,可比对至参考基因组序列上的所有插入片段的平均长度。在同一个ZMW中,插入片段的长度大约是该ZMW中最长的subread的长度。

  • Mapped Polymerase Read Length: The mean read length of post-filter polymerase reads that mapped to the reference sequence. This includes adapters.

    过滤后,可比对至参考基因组序列上的测序reads的长度,Polymerase Read是包含adapters的。

  • Mapped Polymerase Read Length 95%: The 95th percentile of read length of post-filter polymerase reads that mapped to the reference sequence.

    过滤后,可比对至参考基因组序列上,比对率在95%的polymerase reads的长度。

  • Mapped Polymerase Read Length Max: The maximum read length of post-filter polymerase reads that mapped to the reference sequence.

    过滤后,可比对至参考基因组序列上的最长的polymerase reads的长度。

  • Mapped Full Subread Length: The average of the lengths of full subreads that mapped to the reference sequence. Full subreads are subreads flanked by two adapters.

    过滤后,比对至参考基因组序列上的full subreads的平均长度。full subreads两侧均包含adapter。

Analysis - Variants Report

  • Reference: The name of the reference sequence.

  • Reference Length: The length of the reference sequence.

  • Bases Called (%): The percentage of reference sequence that has ≥ 1x coverage. % Bases Called + % Missing Bases should equal 100.

  • Consensus Accuracy: The accuracy of the consensus sequence compared to the reference.

  • Base Coverage: The mean depth of coverage across the reference sequence.

Analysis - Top Variants Report

  • Sequence: The name of the reference sequence.

  • Position: The position of the variant along the reference sequence.

  • Variant: The variant position, type, and affected nucleotide.

  • Type: The variant type: Insertion, Deletion, or Substitution.

  • Coverage: The coverage at position.

  • Confidence: The confidence of the variant call.

  • Genotype: Includes the full number of chromosomes (diploid) or half the number (haploid).

Assembly - Iterations Report

  • Assembly Iterations: The number of iterations of overlap-layout-consensus performed by the de novo or hybrid assembly algorithm.

Assembly - Draft Assembly Report

  • Draft Contigs: The number of contigs output by Celera Assembler, which may include singleton and degenerate contigs. After assembly polishing with Quiver, the final number of contigs may be smaller.

  • N50 Contig Length: The length L of the contig for which 50% of all bases in the final contigs are of length greater than L.

  • Reads Assembled (%): The fraction of all reads that are assembled into contigs in the final assembly.

  • Max Contig Length: The length of the longest contig in the final assembly.

  • Sum of Contig Lengths: The sum of the lengths of all contigs in the final assembly.

Hybrid Assembly - Assembly Iterations Report

  • Input Contigs: The number of contigs used as input to the AHA algorithm.

  • Min Align Score: The minimum alignment score between a read and a contig to use the alignment for scaffolding.

  • Min Link Redundancy: The minimum number of reads that must link two contigs for those contigs to be connected in a scaffold.

  • Min Subread Length: The minimum length required for a subread to be used by the AHA algorithm.

  • Min Contig Length: The minimum length required for a contig to be used by the AHA algorithm.

  • Scaffolds Across Assembly Iterations: The number of scaffolds at a particular iteration of the AHA algorithm.

  • Linking Reads Across Assembly Iterations: The number of linking reads at a particular iteration of the AHA algorithm.

Hybrid Assembly - Final Assembly Report

  • Number: The number of scaffolds, contigs, or gaps in the initial or final assembly.

  • Max Length: The length of the longest scaffold, contig, or gap in the initial or final assembly.

  • N50 Length: The length L of the scaffold, contig, or gap for which 50% of all bases in the initial/final scaffold/contig/gap are of length greater than L.

  • Sum Length: The sum of the lengths of all scaffolds, contigs, or gaps in the initial or final assembly.

  • Initial Scaffolds: The distribution of the lengths of the scaffolds sequences before completing the AHA algorithm. Scaffolds are composed of contigs optionally separated by gap sequences.

  • Final Scaffolds: The distribution of the lengths of the scaffolds sequences after completing the AHA algorithm. Scaffolds are composed of contigs optionally separated by gap sequences.

  • Initial Contigs: The distribution of the lengths of the contig sequences before completing the AHA algorithm. Contigs are stretches of continuous sequence that do not contain gaps.

  • Final Contigs: The distribution of the lengths of the contig sequences after completing the AHA algorithm. Contigs are stretches of continuous sequence that do not contain gaps.

  • Initial Gaps: The distribution of the lengths of the gaps between contig sequences before completing the AHA algorithm.

  • Final Gaps: The distribution of the lengths of the gaps between contig sequences after completing the AHA algorithm.

Base Modifications - Motifs Report

  • Motif: The nucleotide sequence of the methyltransferase recognition motif, using the standard IUPAC nucleotide alphabet.

  • Modified Position: The position within the motif that is modified. The first base is "1". Example: The modified adenine in GATC is at position 2.

  • Modification Type: The type of chemical modification most commonly identified at that motif. These are: 6mA, 4mC, 5mC, or modified_base (modification not recognized by the software.)

  • % Motifs Detected: The percentage of times that this motif was detected as modified across the entire genome.

  • # Of Motifs Detected: The number of times that this motif was detected as modified across the entire genome.

  • # Of Motifs In Genome: The number of times this motif occurs in the genome.

  • Mean Modification QV: The mean modification QV for all instances where this motif was detected as modified.

  • Mean Motif Coverage: The mean coverage for all instances where this motif was detected as modified.

  • Partner Motif: For motifs that are not self-palindromic, this is the complementary sequence.

Assembly - Pre-Assembly Report

  • Seed Bases: The number of bases from seed reads.

  • Pre-Assembled Yield: The percentage of seed read bases that were successfully aligned to generate pre-assembled reads.

  • Pre-Assembled Read Length: The average length of the pre-assembled reads.

  • Length Cutoff: Reads with lengths greater than the length cutoff are used as seed reads for pre-assembly.

  • Pre-Assembled Bases: The number of bases in the pre-assembled reads.

  • Pre-Assembled Reads: The number of reads output by the pre-assembler. Pre-assembled reads are very long, highly accurate reads that can be used as input to a de novo assembler.

  • Pre-Assembled N50: The N50 read length of the pre-assembled reads.


待继续更新。



https://m.sciencenet.cn/blog-1333578-877124.html

上一篇:千年基因联合Genalice提供最强的信息分析解决方案
下一篇:【寻人启事】——欢迎热心童鞋积极转发

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-24 02:20

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部