下一代测序技术产生数百万的短测序片段,它们通常比对到参考基因组上。在许多应用中,下游分析需要的关键信息是比对到每个基因组特征,例如到每个外显子或每个基因上的片段数。计算片段的过程叫做片段汇总。片段汇总对于各种基因组分析都是需要的,但是目前为止在文献中获得了相对少的关注。 featureCounts是一个适合于计数或者从RNA或者从DNA测序实验中产生的片段的片段汇总程序,它实现了高效染色体哈希和特征分块技术。它比现有方法更快(对于基因水平汇总快一个数量级)并需要少得多的计算机内存。它对单端或双端片段有效,并提供了适合于不同测序应用的一系列选项。 下载: http://www.sourceforge.net/projects/subread http://www.bioconductor.org/pa ... .html 语言:R 时间:20131123 参考: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features
针对老师们问到的三代数据分析中的一些问题, 今天主要针对基本信息分析中的测序数据统计、质量QC评估,data summary等, 结合项目案例解释如 下: General - Filtering Report * Polymerase Read Bases : The number of bases in the polymerase read. 即测序获得所有数据量,包含adaptors序列。 * Polymerase Reads : The number of polymerases generating high quality reads. Polymerase reads are trimmed to the high quality region and include bases from adaptors, as well as potentially multiple passes around a circular template. 即高质量测序reads,包含adaptors以及测多次获得multiple subreads。 * Polymerase Read N50 : 50% of all polymerase reads are longer than this value. 测序reads中,50%的reads长度大于N50这个值。 * Polymerase Read Length : The mean trimmed read length of all polymerase reads. The value includes bases from adaptors as well as multiple passes around a circular template. 测序reads的平均长度, 包含adaptors以及multiple subreads。 * Polymerase Read Quality : The mean single-pass read quality of all polymerase reads. 测序reads中, single-pass read 平均 质量值 。 * Post-Filter Polymerase Read Bases : The number of bases in the polymerase reads after filtering, including adaptors. 测序reads过滤后所包含的碱基数, 包含adaptors 以及multiple subreads 。 * Post-Filter Polymerase Reads : The number of polymerases generating trimmed reads after filtering. Polymerase reads include bases from adaptors and multiple passes around a circular template. 过滤后测序reads数,过滤后reads中 包含adaptors 以及multiple subreads 。 * Post-Filter Polymerase Read Length : The mean trimmed read length of all polymerase reads after filtering. The value includes bases from adaptors as well as multiple passes around a circular template. 过滤后测序reads的平均长度,过滤后reads中 包含adaptors 以及multiple subreads 。 * Post-Filter Polymerase Read Quality : The mean single-pass read quality of all polymerase reads after filtering. 过滤后测序reads中, single-pass read 平均 质量值 。 附其他输出报告中的名词释义 : Diagnostic - Adapters Report Adapter Dimers (%) : The % of pre-filter ZMWs which have observed inserts of 0-10bp. These are likely adapter dimers. 接头二聚体(%): 测序reads过滤前,其中0-10bp的序列,极有可能为接头二聚体。 Short Inserts (%) : The % of pre-filter ZMWs which have observed inserts of 11-100bp. These are likely short fragment contamination. 短的插入片段(%): 测序reads过滤前,其中11-100bp的序列,极有可能为短的污染序列。 Diagnostic - Spike-In Control Report Control Sequence : The name of the control sequence. 对照序列/样本的信息。 Control Reads (%) : The percent of post-filter polymerase reads that are from the control sample. The formula for this is: (total # of control reads)/(total # of post-filter reads). 测序reads过滤后,control reads所占过滤后reads的比例。计算公式为: (total # of control reads)/(total # of post-filter reads). Control Polymerase Read Length : The mean mapped read length of the polymerase reads from the control sample. 对照样品测序reads中,可比对上的reads的平均长度。 Control Reads : The total number of polymerase reads from the control sample that passed filtering. 经过滤后,对照样本中总的测序reads数。 Control Subread Accuracy : The mean single-pass accuracy of the mapped polymerase reads from the control sample. 对照样本中,可比对上的测序reads的平均 single-pass准确性。 Control Polymerase Read Length 95% : The 95th percentile of mapped read length of the polymerase reads from the control sample. 对照样本中,比对率在95%的reads长度。 Diagnostic - Loading Report SMRT Cell ID : ID number of the SMRT Cell(s) used in this run. 此次运行中,SMRT Cell (s)的ID号。 Productive ZMWs : The number of ZMWs for this SMRT Cell that produced results with Productivity = 1. 此测序SMRT cell中,零膜波导孔测序产生的序列结果,且聚合酶填充率 Productivity = 1。 Productivity 0 (%) : Percentage of ZMWs that are empty, with no polymerase. 零膜波导孔没有被聚合酶填充,是空的。 Productivity 1 (%) : Percentage of ZMWs that are productive and sequencing. 零膜波导孔被聚合酶填充满,可开展测序。 Productivity 2 (%) : Percentage of ZMWs that are not P0 (empty) or P1 (productive). This may occur for a variety of reasons and the sequence data is not usable. 零膜波导孔填充值既不是 P0 (empty) 也不是 P1 (productive)。这可能是由多方面的原因导致的、且测序数据不可用。 Resequencing - Coverage Report Coverage : The mean depth of coverage across the reference sequence. 总测序数据量相对参考基因组序列的平均覆盖度(平均测序深度)。 Missing Bases (%) : The percentage of the reference sequence that has zero coverage. 参考基因组序列中完全没有被覆盖到的区域,即该区域测序深度为0。 Resequencing - Mapping Report Post-Filter Reads : The number of reads that passed filtering. 过滤后的reads数。 Mapped Reads : The number of post-filter reads that mapped to the reference sequence. 过滤后的reads中,可比对至参考基因组序列上的reads数。 Mapped Subreads : The number of post-filter subreads that mapped to the reference sequence. 过滤后获得的subreads中, 可比对至参考基因组序列上的 subreads 数。 Mapped CCS Reads : The number of post-filter CCS reads that mapped to the reference sequence. CCS即为consensus sequence,由来自同一个ZMWs的subreads比对获得。 这里是指过滤后,可比对至参考基因组序列上的CCS序列数。 Mapped Subread Bases : The number of post-filter bases from all subreads that mapped to the reference sequence. This does not include adapters. 过滤后,可比对至参考基因组序列上的subreads的总碱基数。这里不包含adapters。 Mapped CCS Read Bases : The number of post-filter CCS read bases that mapped to the reference sequence. This does not include adapters. 过滤后,可比对至参考基因组序列上的CCS的总碱基数。 这里不包含adapters。 Mapped Subread Accuracy : The mean accuracy of post-filter subreads that mapped to the reference sequence. 过滤后,可 比对至参考基因组序列上的subreads的平均准确性。 Mapped CCS Read Accuracy : The mean accuracy of post-filter CCS reads that mapped to the reference sequence. 过滤后,可比对至参考基因组序列上的CCS的 平均准确性。 Mapped Subread Length : The mean read length of post-filter subreads that mapped to the reference sequence. This does not include adapters. 过滤后,可 比对至参考基因组序列上的subreads的平均长度。 这里不包含adapters。 Mapped Read Length of Insert : The mean read length of all insert sequences, which includes only mapped sequences. The read length of insert is approximately the longest subread length per ZMW. 过滤后,可比对至参考基因组序列上的所有插入片段的平均长度。在同一个ZMW中,插入片段的长度大约是该ZMW中最长的subread的长度。 Mapped Polymerase Read Length : The mean read length of post-filter polymerase reads that mapped to the reference sequence. This includes adapters. 过滤后,可比对至参考基因组序列上的测序reads的长度, Polymerase Read是包含adapters的。 Mapped Polymerase Read Length 95% : The 95th percentile of read length of post-filter polymerase reads that mapped to the reference sequence. 过滤后,可 比对至参考基因组序列上, 比对 率在95%的 polymerase reads的 长度。 Mapped Polymerase Read Length Max : The maximum read length of post-filter polymerase reads that mapped to the reference sequence. 过滤后,可 比对至参考基因组序列上的最长的 polymerase reads的 长度。 Mapped Full Subread Length : The average of the lengths of full subreads that mapped to the reference sequence. Full subreads are subreads flanked by two adapters. 过滤后, 可 比对至参考基因组序列上的 full subreads的平均长度。 full subreads两侧均包含adapter。 Analysis - Variants Report Reference : The name of the reference sequence. Reference Length : The length of the reference sequence. Bases Called (%) : The percentage of reference sequence that has ≥ 1x coverage. % Bases Called + % Missing Bases should equal 100. Consensus Accuracy : The accuracy of the consensus sequence compared to the reference. Base Coverage : The mean depth of coverage across the reference sequence. Analysis - Top Variants Report Sequence : The name of the reference sequence. Position : The position of the variant along the reference sequence. Variant : The variant position, type, and affected nucleotide. Type : The variant type: Insertion, Deletion, or Substitution. Coverage : The coverage at position. Confidence : The confidence of the variant call. Genotype : Includes the full number of chromosomes (diploid) or half the number (haploid). Assembly - Iterations Report Assembly Iterations : The number of iterations of overlap-layout-consensus performed by the de novo or hybrid assembly algorithm. Assembly - Draft Assembly Report Draft Contigs : The number of contigs output by Celera Assembler, which may include singleton and degenerate contigs. After assembly polishing with Quiver, the final number of contigs may be smaller. N50 Contig Length : The length L of the contig for which 50% of all bases in the final contigs are of length greater than L. Reads Assembled (%) : The fraction of all reads that are assembled into contigs in the final assembly. Max Contig Length : The length of the longest contig in the final assembly. Sum of Contig Lengths : The sum of the lengths of all contigs in the final assembly. Hybrid Assembly - Assembly Iterations Report Input Contigs : The number of contigs used as input to the AHA algorithm. Min Align Score : The minimum alignment score between a read and a contig to use the alignment for scaffolding. Min Link Redundancy : The minimum number of reads that must link two contigs for those contigs to be connected in a scaffold. Min Subread Length : The minimum length required for a subread to be used by the AHA algorithm. Min Contig Length : The minimum length required for a contig to be used by the AHA algorithm. Scaffolds Across Assembly Iterations : The number of scaffolds at a particular iteration of the AHA algorithm. Linking Reads Across Assembly Iterations : The number of linking reads at a particular iteration of the AHA algorithm. Hybrid Assembly - Final Assembly Report Number : The number of scaffolds, contigs, or gaps in the initial or final assembly. Max Length : The length of the longest scaffold, contig, or gap in the initial or final assembly. N50 Length : The length L of the scaffold, contig, or gap for which 50% of all bases in the initial/final scaffold/contig/gap are of length greater than L. Sum Length : The sum of the lengths of all scaffolds, contigs, or gaps in the initial or final assembly. Initial Scaffolds : The distribution of the lengths of the scaffolds sequences before completing the AHA algorithm. Scaffolds are composed of contigs optionally separated by gap sequences. Final Scaffolds : The distribution of the lengths of the scaffolds sequences after completing the AHA algorithm. Scaffolds are composed of contigs optionally separated by gap sequences. Initial Contigs : The distribution of the lengths of the contig sequences before completing the AHA algorithm. Contigs are stretches of continuous sequence that do not contain gaps. Final Contigs : The distribution of the lengths of the contig sequences after completing the AHA algorithm. Contigs are stretches of continuous sequence that do not contain gaps. Initial Gaps : The distribution of the lengths of the gaps between contig sequences before completing the AHA algorithm. Final Gaps : The distribution of the lengths of the gaps between contig sequences after completing the AHA algorithm. Base Modifications - Motifs Report Motif : The nucleotide sequence of the methyltransferase recognition motif, using the standard IUPAC nucleotide alphabet. Modified Position : The position within the motif that is modified. The first base is 1. Example: The modified adenine in GATC is at position 2. Modification Type : The type of chemical modification most commonly identified at that motif. These are: 6mA, 4mC, 5mC, or modified_base (modification not recognized by the software.) % Motifs Detected : The percentage of times that this motif was detected as modified across the entire genome. # Of Motifs Detected : The number of times that this motif was detected as modified across the entire genome. # Of Motifs In Genome : The number of times this motif occurs in the genome. Mean Modification QV : The mean modification QV for all instances where this motif was detected as modified. Mean Motif Coverage : The mean coverage for all instances where this motif was detected as modified. Partner Motif : For motifs that are not self-palindromic, this is the complementary sequence. Assembly - Pre-Assembly Report Seed Bases : The number of bases from seed reads. Pre-Assembled Yield : The percentage of seed read bases that were successfully aligned to generate pre-assembled reads. Pre-Assembled Read Length : The average length of the pre-assembled reads. Length Cutoff : Reads with lengths greater than the length cutoff are used as seed reads for pre-assembly. Pre-Assembled Bases : The number of bases in the pre-assembled reads. Pre-Assembled Reads : The number of reads output by the pre-assembler. Pre-assembled reads are very long, highly accurate reads that can be used as input to a de novo assembler. Pre-Assembled N50 : The N50 read length of the pre-assembled reads. 待继续更新。
SMRT ® Portal Help Pacific Biosciences Terminology General Terminology Adapters : Hairpin loops that are ligated to both ends of the double stranded DNA insert. When adapter sequences are removed, the read is split into multiple subreads . 即类似发夹结构的SMRT bell adapters,在文库构建时需要连接至双链DNA模板的平末端。去除adapters后,所获得即为 multiple subreads。 Movie : Real-time observation of a SMRT Cell. 即测序一个SMRT cell实时观察时长。 Read : A contiguous sequence generated from a ZMW that includes an insert sequence and may include an adapter sequence. A read is composed of alternating subreads and adapters. 指从零膜波导孔测序获得的连续的序列,其包含insert DNA序列(靶序列,即subreads)、接头序列。 Sequencing ZMW : A ZMW that is expected to be able to produce a sequence if it is populated with a polymerase. ZMWs used for automated SMRT Cell alignment are not considered sequencing ZMWs. 零膜波导孔中被聚合酶填满,可以测序获得read,即为可测序零膜波导孔。 Subread : Sequence generated by splitting the raw sequence from a ZMW by the adapters. This is the post-sequencing version of the “insert DNA” used in sample preparation. 即 insert DNA序列, 靶序列。 Zero-Mode Waveguide (ZMW) : A nanophotonic device for confining light to a small observation volume that can be, for example, a small hole in a conductive layer whose diameter is too small to permit the propagation of light in the wavelength range used for detection. 即零膜波导孔。 Primary Analysis Terminology Adapter Screening : Annotates adapter read locations. Used to break a read into subreads during secondary analysis mapping and Circular Consensus. 鉴定adapter的位置。在标准分析比对和 Circular Consensus分析中,将每条read的adapter去除获得subreads。 High Quality Region Screening : Annotates the high quality sequencing regions of a read to be used during Raw Read Trimming. 在 Raw Read Trimming环节,鉴别每条read的高质量测序区域。 Insert Screening : Annotates insert DNA regions in the raw read. 在raw read中鉴别哪段序列为insert DNA。 Quality Value Assignment : A prediction of the error probability of a basecall. 评估每个碱基的质量。 Quality Value (QV) : The total probability that the basecall is an insertion or substitution or is preceded by a deletion. QV = -10 * log10(p) Insertion QV : The probability that the basecall is an insertion with respect to the true sequence. Deletion QV : The probability that a deletion error occurred before the current base. Substitution QV : The probability that the basecall is a substitution. Raw Read Trimming : Extraction of high quality regions from a raw read. This results in a read. Read Quality Assignment : A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to-noise ratio, average base QV, interpulse duration, and so on). This is used during secondary analysis filtering. Secondary Analysis Terminology Consensus : Generation of a consensus sequence from multiple-sequence alignment. De Novo Assembly : Assembly of all subreads without a reference sequence. Filtering : Removes reads that do not meet the Read Quality and Read length parameters set by the user. The current default filtering parameters defined by Pacific Biosciences are: Read Quality ≥ .75 (as of SMRT Analysis v1.3.1) Read length ≥ 50 bases Mapping : Local alignment of a read or subread to a reference sequence. Accuracy Terminology Circular Consensus Accuracy : Accuracy of the circular consensus read. Consensus Accuracy : Accuracy of the consensus sequence compared to the reference. Read Quality : A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to-noise ratio, average base QV, interpulse duration, and so on). Single Molecule Raw Accuracy : Accuracy based on one pass on one single molecule. Subread Accuracy : The post-mapping accuracy of the basecalls. Formula: , where errors = number of deletions + insertions + substitutions. Read Terminology De Novo Circular Consensus (CCS) Read : The consensus sequence produced from the alignment of subreads taken from a single ZMW. This is not aligned against a reference sequence. Raw Read : All base calls from a ZMW. Includes insert DNA and adapter sequence. Single Molecule Variant Detection (SMVD) Read : The consensus sequence produced using all subreads taken from a single ZMW and aligned to a known reference sequence. (This was formerly known as RCCS .) Read Length Terminology Mapped Read length : The distance between the first aligned base and the last aligned base in a raw read, inclusive of insert and adapter alignments. Mapped Subread Read length : The length of the subread alignment to a target reference sequence. This does not include the adapter sequence. Read length : The total number of bases produced from a ZMW after trimming. This may include the adapter sequence.
SMRT ® Portal Help What is SMRT Portal and how do I use it? Use SMRT Portal to perform secondary analysis of sequencing data generated by one or more PacBio System runs. You create and submit jobs . Jobs specify the SMRT Cells whose data will be analyzed, as well as which analysis protocols to use. After the job has completed, you then view the secondary analysis data generated. Working with SMRT Portal Create and submit a job. View the secondary analysis data generated. Create a hybrid assembly using high-confidence contigs. Open , monitor , or delete jobs. Export metrics and table data. Change your password and restore table settings Reports generated by SMRT Portal SMRT Portal reports Administrating and Managing SMRT Portal For the following functions, you must be logged in as a scientist or administrator : Managing secondary analysis protocols Managing reference sequences Importing raw data from SMRT Cells for analysis Importing SMRT Pipe jobs For the following functions, you must be logged in as an administrator : Managing application users Managing groups Specifying site-wide application settings Archiving and restoring jobs Reference SMRT Portal hardware/software requirements Protocols provided by Pacific Biosciences Pacific Biosciences software overview Pacific Biosciences terminology For troubleshooting information, see http://github.com/PacificBiosciences/SMRT-Analysis/wiki/Troubleshooting-the-SMRT-Analysis-Suite For additional technical support, contact Pacific Biosciences at TechSupport@pacificbiosciences.com or 877-920-7222.
作为全球首批使用 PacBio 最新试剂 P6-C4 的公司,千年基因通过对实验条件的不断优化及实验流程的严格控制已率先实现 PacBio RS II 三代测序的升级,读长及通量均得到显著提升。 千年基因 PacBio RS II 三代测序完美升级后,平均读长达 11Kb 以上, reads N50 长度达 16Kb 以上,每个 SMRT Cell 的测序通量高达 1Gb ,远高于 PacBio 官方的参考标准。更长读长和更高通量将有利于基因组 de novo 测序、宏基因组测序、全长转录本测序、全长 16S rDNA 测序等项目的开展。 千年基因的 PacBio RS II 三代测序自提供服务以来,已与国内大量科研单位合作开展了诸多动植物及微生物基因组 de novo 测序项目。同时,千年基因将首次应用三代平台完成人类基因组 de novo 测序,并利用三代平台长读长的优势组装得到最高质量的亚洲人参考基因组图谱,以便于亚洲人致病变异的深入挖掘。 来源于 千年基因官网 。
SOAPindel: Efficient identification of indels from short paired reads Shengting Li, Ruiqiang Li, Heng Li, Jianliang Lu, Yingrui Li, Lars Bolund, Mikkel Schierup, and Jun Wang Genome Res. published 12 September 2012; doi:10.1101/gr.132480.111 http://genome.cshlp.org/content/early/2012/09/12/gr.132480.111.abstract.html We present a new approach to indel calling which explicitly exploits that indel differences between a reference and a sequenced sample make the mapping of reads less efficient. We assign all unmapped reads with a mapped partner to their expected genomic positions and then perform extensive de novo assembly on the regions with many unmapped reads to resolve homozygous, heterozygous and complex indels by exhaustive traversal of the de Bruijn graph. The method is implemented in the software SOAPindel and provides a list of candidate indels with quality scores. We compare SOAPindel to Dindel, Pindel and GATK on simulated data and find similar or better performance for short indels (10 bp) and higher sensitivity and specificity for long indels. A validation experiment suggests that SOAPindel has a false positive rate around 10% for long indels (5 bp) while still providing many more candidate indels than other approaches.