科学网

 找回密码
  注册

tag 标签: reads

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

RNA-seq测序中intronic reads占比较多的原因?
chinapubmed 2019-8-23 07:17
使用 ReSeQC 做RNA-seq reads distribution时,会发现intronic占比比较多,大概30-40%,原因如下: 1,由于转录是个动态过程,期间会存在incompleted (nancant, on-going, co-transcripted,unspliced ...) spliced RNA,并且由于intron的长度一般是exon的20倍左右,所以即使有1%的incompleted spliced RNA,也会对这个分布产生较大影响。这是一个不可避免的主要原因,并且与物种、样品来源(cell, tissue, ...)等相关。 参考: Total RNA sequencing reveals nascent transcription and widespread co-transcriptional splicing in the human brain 例如 pre-mrna fraction Intronic read fraction 1% 16% 2% 28% 5% 49% 2,存在genomic DNA污染,这个也是一部分原因,主要看实验过程的技术、试剂、操作等 3,基因组注释不全。这个对于研究较多的物种,一般没太大影响,不过也是一个因素。 4,其他因素:所使用的分析软件,包括比对软件,intronic reads的定义等等。 参考: http://seqanswers.com/forums/showthread.php?t=5519 http://seqanswers.com/forums/showthread.php?t=15296
个人分类: 生物信息|3815 次阅读|0 个评论
HiCPro分析流程详解
luria 2019-6-17 14:20
本篇接着上一篇《 HiCPro 的安装与使用 》,详细讲解 Hi-C 数据比对软件 HiCPro 的分析流程。 HiCPro 的安装与使用,请查看: http://blog.sciencenet.cn/blog-2970729-1182259.html 3. HiCPro 分析流程 HiCPro 处理各步骤流程如下,总体来说可以分为两大部分,对应 HiCPro 的两个 Steps : step1 比对, step2 Hi-C fragment 相关分析 3.1 HiCPro 先采用 bowtie 分别对 PE reads 进行比对 R1 reads 比对: /path/to/bowtie2 --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder --un bowtie_results/bwt2_global/sample/sample_R1_sample_genome_ref.bwt2glob.unmap.fastq --rg-id BMG --rg SM:sample_R1 -p 24 -x /path/to/ref/index -U rawdata/sample/sample_R1.fastq.gz 2 logs/sample/sample_R1_bowtie2.log| /path/to/samtools view -F 4 -bS - bowtie_results/bwt2_global/sample/sample_R2_sample_genome_ref.bwt2glob.bam ## bowtie2 相关参数如下 # 注意:这里使用 --very-sensitive 。采用 bowtiew 的 Hi-C 比对软件通常会用最严格比对 # bowtie2 -L 选项是 seed substrings 的长度, 3-32 之间 # bowtie2 --end-to-end 是 entire read must align; no clipping # bowtie2 --reorder 是 force SAM output order to match order of input reads # bowtie2 --un 选项表示输出未比对上的 fastq 序列到该文件中 # bowtie2 --rg-id --rg 指定 group id 相关信息 # bowtie2 -p 指定线程数 # bowtie2 -x 指定参考基因组索引 # bowtie2 -U 指定输入的单端 reads 文件 ## samtools view 相关参数 -F 4 清理 unmapped read 。因为未比对上的 reads 已经记录在 unmap.fastq 文件中。输出的 bam 文件中不会再保留未比对上的 alignment 信息,这样可以减少后续读取文件的速度。 R2 reads 比对参数与 R1 一致 /path/to/bowtie2 --very-sensitive -L 30 --score-min L,-0.6,-0.2 --end-to-end --reorder --un bowtie_results/bwt2_global/sample/sample_R2_sample_genome_ref.bwt2glob.unmap.fastq --rg-id BMG --rg SM:sample_R2 -p 24 -x /path/to/ref/index -U rawdata/sample/sample_R2.fastq.gz 2 logs/sample/sample_R2_bowtie2.log| /path/to/samtools view -F 4 -bS - bowtie_results/bwt2_global/sample/sample_R2_sample_genome_ref.bwt2glob.bam 完成后会在 bowtie_results 目录下生成一个 bwt2_global 的文件夹,该文件夹下还有一个以样本名命名的子目录,其中包括 R1 比对 bam 文件, unmapped fastq 文件, R2 的比对 bam 文件, unmapped fastq 文件 3.2 对未比对上的 reads 进行 trim 和再比对 # 对 R1 未比对上的 reads 进行 trim /path/to/HiC-Pro_2.11.1/scripts/cutsite_trimming --fastq bowtie_results/bwt2_global/sample/sample_R1_sample_genome_ref.bwt2glob.unmap.fastq --cutsite GATCGATC --out bowtie_results/bwt2_local/sample/sample_R1_sample_genome_ref.bwt2glob.unmap_trimmed.fastq logs/sample/sample_R1_sample_genome_ref.bwt2glob.unmap_readsTrimming.log 21 # --fastq 指定输入的未比对上的 reads # --cutsite 指定酶切位点序列 # --out 输出 trim 之后的 reads # 对 R2 未比对上的 reads 进行 trim 。参数与 R1 一致 /path/to/HiC-Pro_2.11.1/scripts/cutsite_trimming --fastq bowtie_results/bwt2_global/sample/sample_R2_sample_genome_ref.bwt2glob.unmap.fastq --cutsite GATCGATC --out bowtie_results/bwt2_local/sample/sample_R2_sample_genome_ref.bwt2glob.unmap_trimmed.fastq logs/sample/sample_R2_sample_genome_ref.bwt2glob.unmap_readsTrimming.log 21 # 对 R1 trimmed reads 进行再比对 注意:这一步与第一次比对 ( 即 3.1 )节中的比对过程差异是: samtools view 没有用 -F 4 ,即将未比对上的 reads 也记录在了 bam 文件中 这是因为后期将会统计比对率,需考虑到 unmapped reads /path/to/bowtie2 --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder --rg-id BML --rg SM:sample_R1_sample_genome_ref.bwt2glob.unmap -p 24 -x /path/to/ref/index -U bowtie_results/bwt2_local/sample/sample_R1_sample_genome_ref.bwt2glob.unmap_trimmed.fastq 2 logs/sample/sample_R1_sample_genome_ref.bwt2glob.unmap_bowtie2.log | /path/to/samtools view -bS - bowtie_results/bwt2_local/sample/sample_R1_sample_genome_ref.bwt2glob.unmap_bwt2loc.bam # 对 R2 trimmed reads 进行再比对 /path/to/bowtie2 --very-sensitive -L 20 --score-min L,-0.6,-0.2 --end-to-end --reorder --rg-id BML --rg SM:sample_R2_sample_genome_ref.bwt2glob.unmap -p 24 -x /path/to/ref/index -U bowtie_results/bwt2_local/sample/sample_R2_sample_genome_ref.bwt2glob.unmap_trimmed.fastq 2 logs/sample/sample_R2_sample_genome_ref.bwt2glob.unmap_bowtie2.log | /path/to/samtools view -bS - bowtie_results/bwt2_local/sample/sample_R2_sample_genome_ref.bwt2glob.unmap_bwt2loc.bam 运行完成后会在 bowtie_results 目录下生成一个 bwt2_local 的文件夹,该文件夹下还有一个以样本名命名的子目录,其中会生成 R1 trimmed reads 比对的 bam 结果,以及 trim 后还是无法比对上的 reads 序列文件。 R2 的也有类似的两个文件。 3.3 分别对 R1 R2 reads 两次比对的结果合并 # 对 R1 两次比对的结果合并 /path/to/samtools merge -@ 24 -n -f bowtie_results/bwt2/sample/sample_R1_sample_genome_ref.bwt2merged.bam bowtie_results/bwt2_global/sample/sample_R1_sample_genome_ref.0.bwt2glob.bam bowtie_results/bwt2_local/sample/sample_R1_sample_genome_ref.bwt2glob.unmap_bwt2loc.bam # -f 选项,表示如果存在输出文件则 overwrite # -n 表示 Input files are sorted by read name # 对 R2 两次比对的结果合并 /path/to/samtools merge -@ 24 -n -f bowtie_results/bwt2/sample/sample_R2_sample_genome_ref.bwt2merged.bam bowtie_results/bwt2_global/sample/sample_R2_sample_genome_ref.bwt2glob.bam bowtie_results/bwt2_local/sample/sample_R2_sample_genome_ref.bwt2glob.unmap_bwt2loc.bam # 对 R1 合并后的结果排序 /path/to/samtools sort -@ 24 -n -T tmp/sample_R1_sample_genome_ref -o bowtie_results/bwt2/sample/sample_R1_sample_genome_ref.bwt2merged.sorted.bam bowtie_results/bwt2/sample/sample_R1_sample_genome_ref.bwt2merged.bam # 对 R2 合并后的结果排序 /path/to/samtools sort -@ 24 -n -T tmp/sample_R2_sample_genome_ref -o bowtie_results/bwt2/sample/sample_R2_sample_genome_ref.bwt2merged.sorted.bam bowtie_results/bwt2/sample/sample_R2_sample_genome_ref.bwt2merged.bam # 将 R1 输出的 .sort.bam 文件改名为 .bam 文件,这一步估计是程序写完后发现有个 bug ,为了不改动后面的程序,加的一步 mv bowtie_results/bwt2/sample/sample_R1_sample_genome_ref.bwt2merged.sorted.bam bowtie_results/bwt2/sample/sample_R1_sample_genome_ref.bwt2merged.bam # 将 R2 输出的 .sort.bam 文件改名为 .bam 文件 mv bowtie_results/bwt2/sample/sample_R2_sample_genome_ref.bwt2merged.sorted.bam bowtie_results/bwt2/sample/sample_R2_sample_genome_ref.bwt2merged.bam 运行完成后,会在 bowtie_results 目录下生成一个 bwt2 文件夹,和以样本名为名字的子文件夹。在其中生成两个以 bwt2merged.bam 结尾的文件,分别代表 R1 R2 的结果 3.4 统计比对率 采用的方法是 samtools view -c ,具体如下: # 分别统计 R1 R2 最终结果中总的 reads 和成对的 reads /path/to/samtools view -c bowtie_results/bwt2/sample/sample_R1_sample_genome_ref.bwt2merged.bam /path/to/samtools view -c bowtie_results/bwt2/sample/sample_R2_sample_genome_ref.bwt2merged.bam /path/to/samtools view -c -F 4 bowtie_results/bwt2/sample/sample_R1_sample_genome_ref.bwt2merged.bam /path/to/samtools view -c -F 4 bowtie_results/bwt2/sample/sample_R2_sample_genome_ref.bwt2merged.bam # 分别统计 R1 R2 两次比对结果中,比对上的 reads 数 /path/to/samtools view -c -F 4 bowtie_results/bwt2_global/sample/sample_R1_sample_genome_ref.bwt2glob.bam /path/to/samtools view -c -F 4 bowtie_results/bwt2_global/sample/sample_R2_sample_genome_ref.bwt2glob.bam /path/to/samtools view -c -F 4 bowtie_results/bwt2_local/sample/sample_R1_sample_genome_ref.bwt2glob.unmap_bwt2loc.bam /path/to/samtools view -c -F 4 bowtie_results/bwt2_local/sample/sample_R2_sample_genome_ref.bwt2glob.unmap_bwt2loc.bam 3.5 采用 HiCPro 的 mergeSAM.py 程序合并 PE reads /path/to/python /path/to/HiC-Pro_2.11.1/scripts/mergeSAM.py -q 10 -t -v -f bowtie_results/bwt2/sample/sample_R1_sample_genome_ref.bwt2merged.bam -r bowtie_results/bwt2/sample/sample_R2_sample_genome_ref.bwt2merged.bam -o bowtie_results/bwt2/sample/sample_sample_genome_ref.bwt2pairs.bam # -q 指定最小的 mapping quality # -t 表示需生成比对统计结果 # -v 用于 debug # -f/--forward -r/--reverse 分别输入上一步生成的 PE reads 分开比对最终的 bam 文件 ( 即 3.3 中结果 ) # -o 输出处理后的结果 此外, HiCPro 的 mergeSAM.py 程序还有以下几个参数: report singleton 给出单端比对的结果 report multiple hits 给出多处比对的结果 运行完成后,会在 bwt2 的子目录中生成后缀为 bwt2pairs.bam 的过滤结果文件。以及其统计文件(后缀为 .bwt2pairs.pairstat 的文件),其中会统计完整的比对信息。包括 Total_pairs_processed, Unmapped_pairs, Low_qual_pairs, Unique_paired_alignments, Multiple_pairs_alignments, Pairs_with_singleton, Low_qual_singleton, Unique_singleton_alignments, Multiple_singleton_alignments, Reported_pairs 至此,比对的部分全部完成,所有的结果都在 bowtie_results 目录下 ===================================================== 再进行 Hi-C fragment 相关分析,所有的结果都在 hic_results 目录下 3.6 利用 HiCPro 的 mapped_2hic_fragments.py 程序将比对结果转化为 Hi-C 片段信息 /path/to/python /path/to/HiC-Pro_2.11.1/scripts/mapped_2hic_fragments.py -v -a -f /path/to/restriction_enzyme_cutting_site.MboI.txt -r bowtie_results/bwt2/sample/sample_sample_genome_ref.bwt2pairs.bam -o hic_results/data/sample # -v 用于 debug # -a 记录所有的信息,包括 self-circle, dangling end 等 # -f 指定最开始时,检测出的基因组序列上的酶切位点信息文件,即 HiC-Pro_2.11.1/bin/utils/digest_genome.py 生成的结果 # -r 指定 bowtie2 比对的最终结果,即 3.5 节中的结果 # -o 指定输出目录 此处,还可以指定 insert size 的阈值等,具体可参见 --help 再对输出的 valid pairs 文件进行排序: LANG=en; sort -T tmp -k2,2V -k3,3n -k5,5V -k6,6n -o hic_results/data/sample/sample_sample_genome_ref.bwt2pairs.validPairs hic_results/data/sample/sample_sample_genome_ref.bwt2pairs.validPairs # 这里是直接将 valid paris 原路径排序 3.7 对所有的 valid pairs 进行合并,并且去掉 PCR duplication LANG=en; sort -T tmp -S 50% -k2,2V -k3,3n -k5,5V -k6,6n -m hic_results/data//sample/sample_sample_genome_ref.bwt2pairs.validPairs | awk -F\\t 'BEGIN{c1=0;c2=0;s1=0;s2=0}(c1!=$2 || c2!=$5 || s1!=$3 || s2!=$6){print;c1=$2;c2=$5;s1=$3;s2=$6}' hic_results/data/sample/sample.allValidPairs 这一步是先将多个 valid pairs 文件进行合并,例如加测了几次,如果之前没有合并,这里可以合到一起。然后再确定当前行和上一行是否相同,如果相同则为 PCR duplication, 需去除 3.6 和 3.7 节的结果都生成在 hic_results/data 目录中 3.8 采用 HiCPro 的 merge_statfiles.py 程序对 bowtie2 比对的多个统计结果合并 /path/to/python /path/to/HiC-Pro_2.11.1/scripts/merge_statfiles.py -d bowtie_results/bwt2/sample/ -p *_R1*.mapstat -v hic_results/stats/sample/sample_R1.mmapstat /path/to/python /path/to/HiC-Pro_2.11.1/scripts/merge_statfiles.py -d bowtie_results/bwt2/sample/ -p *_R2*.mapstat -v hic_results/stats/sample/sample_R2.mmapstat /path/to/python /path/to/HiC-Pro_2.11.1/scripts/merge_statfiles.py -d bowtie_results/bwt2/sample/ -p *.pairstat -v hic_results/stats/sample/sample.mpairstat /path/to/python /path/to/HiC-Pro_2.11.1/scripts/merge_statfiles.py -d hic_results/data//sample/ -p *.RSstat -v hic_results/stats/sample/sample.mRSstat # 这一步统计的结果都在 hic_results/stats 目录下 3.9 跟据 BIN_SIZE 来构建 matrix 这一步会按照 HiCPro config-hicpro.txt 文件中指定的 BIN_SIZE ,对 valid pairs 进行分配,构建 matrix cat hic_results/data//sample/sample.allValidPairs | /path/to/HiC-Pro_2.11.1/scripts/build_matrix --matrix-format upper --binsize 20000 --chrsizes /path/to/ref/reference.size --ifile /dev/stdin --oprefix hic_results/matrix/sample/raw/20000/sample_${bsize} # --chrsizes 是最初统计出的基因组中每条序列的长度 运行完,各分辨率的 matrix 输出到 hic_results/matrix/sample/raw 目录下 3.10 对统计结果画图 会生成 5 个图,分别是 plotHiCContactRanges_sample.pdf, plotHiCFragmentSize_sample.pdf, plotMappingPairing_sample.pdfplot,HiCFragment_sample.pdf, plotMapping_chrysanthemum.pdf, 如下图: 3.11 采用 ice 对 raw matrix 做 normalization /path/to/python /path/to/HiC-Pro_2.11.1/scripts/ice --results_filename hic_results/matrix/sample/iced/20000/sample_20000_iced.matrix --filter_low_counts_perc 0.02 --filter_high_counts_perc 0 --max_iter 100--eps 0.1 --remove-all-zeros-loci --output-bias 1 --verbose 1 hic_results/matrix//sample/raw/20000/sample_20000.matrix # 对各分辨率生成的 raw matrix 做 normalization ,结果都输出在 hic_results/matrix/sample/ice 目录下
个人分类: Hi-C|13591 次阅读|0 个评论
[转载]featureCounts -- 一个分配测序片段到基因组特征的高效程序(数基因中的reads数)
chinapubmed 2019-5-28 07:30
下一代测序技术产生数百万的短测序片段,它们通常比对到参考基因组上。在许多应用中,下游分析需要的关键信息是比对到每个基因组特征,例如到每个外显子或每个基因上的片段数。计算片段的过程叫做片段汇总。片段汇总对于各种基因组分析都是需要的,但是目前为止在文献中获得了相对少的关注。 featureCounts是一个适合于计数或者从RNA或者从DNA测序实验中产生的片段的片段汇总程序,它实现了高效染色体哈希和特征分块技术。它比现有方法更快(对于基因水平汇总快一个数量级)并需要少得多的计算机内存。它对单端或双端片段有效,并提供了适合于不同测序应用的一系列选项。 下载: http://www.sourceforge.net/projects/subread http://www.bioconductor.org/pa ... .html 语言:R 时间:20131123 参考: featureCounts: an efficient general purpose program for assigning sequence reads to genomic features
个人分类: 软件|4425 次阅读|0 个评论
关于PacBio RS II reads analysis名词释义(三)
alinatingting 2015-3-25 11:32
针对老师们问到的三代数据分析中的一些问题, 今天主要针对基本信息分析中的测序数据统计、质量QC评估,data summary等, 结合项目案例解释如 下: General - Filtering Report * Polymerase Read Bases : The number of bases in the polymerase read. 即测序获得所有数据量,包含adaptors序列。 * Polymerase Reads : The number of polymerases generating high quality reads. Polymerase reads are trimmed to the high quality region and include bases from adaptors, as well as potentially multiple passes around a circular template. 即高质量测序reads,包含adaptors以及测多次获得multiple subreads。 * Polymerase Read N50 : 50% of all polymerase reads are longer than this value. 测序reads中,50%的reads长度大于N50这个值。 * Polymerase Read Length : The mean trimmed read length of all polymerase reads. The value includes bases from adaptors as well as multiple passes around a circular template. 测序reads的平均长度, 包含adaptors以及multiple subreads。 * Polymerase Read Quality : The mean single-pass read quality of all polymerase reads. 测序reads中, single-pass read 平均 质量值 。 * Post-Filter Polymerase Read Bases : The number of bases in the polymerase reads after filtering, including adaptors. 测序reads过滤后所包含的碱基数, 包含adaptors 以及multiple subreads 。 * Post-Filter Polymerase Reads : The number of polymerases generating trimmed reads after filtering. Polymerase reads include bases from adaptors and multiple passes around a circular template. 过滤后测序reads数,过滤后reads中 包含adaptors 以及multiple subreads 。 * Post-Filter Polymerase Read Length : The mean trimmed read length of all polymerase reads after filtering. The value includes bases from adaptors as well as multiple passes around a circular template. 过滤后测序reads的平均长度,过滤后reads中 包含adaptors 以及multiple subreads 。 * Post-Filter Polymerase Read Quality : The mean single-pass read quality of all polymerase reads after filtering. 过滤后测序reads中, single-pass read 平均 质量值 。 附其他输出报告中的名词释义 : Diagnostic - Adapters Report Adapter Dimers (%) : The % of pre-filter ZMWs which have observed inserts of 0-10bp. These are likely adapter dimers. 接头二聚体(%): 测序reads过滤前,其中0-10bp的序列,极有可能为接头二聚体。 Short Inserts (%) : The % of pre-filter ZMWs which have observed inserts of 11-100bp. These are likely short fragment contamination. 短的插入片段(%): 测序reads过滤前,其中11-100bp的序列,极有可能为短的污染序列。 Diagnostic - Spike-In Control Report Control Sequence : The name of the control sequence. 对照序列/样本的信息。 Control Reads (%) : The percent of post-filter polymerase reads that are from the control sample. The formula for this is: (total # of control reads)/(total # of post-filter reads). 测序reads过滤后,control reads所占过滤后reads的比例。计算公式为: (total # of control reads)/(total # of post-filter reads). Control Polymerase Read Length : The mean mapped read length of the polymerase reads from the control sample. 对照样品测序reads中,可比对上的reads的平均长度。 Control Reads : The total number of polymerase reads from the control sample that passed filtering. 经过滤后,对照样本中总的测序reads数。 Control Subread Accuracy : The mean single-pass accuracy of the mapped polymerase reads from the control sample. 对照样本中,可比对上的测序reads的平均 single-pass准确性。 Control Polymerase Read Length 95% : The 95th percentile of mapped read length of the polymerase reads from the control sample. 对照样本中,比对率在95%的reads长度。 Diagnostic - Loading Report SMRT Cell ID : ID number of the SMRT Cell(s) used in this run. 此次运行中,SMRT Cell (s)的ID号。 Productive ZMWs : The number of ZMWs for this SMRT Cell that produced results with Productivity = 1. 此测序SMRT cell中,零膜波导孔测序产生的序列结果,且聚合酶填充率 Productivity = 1。 Productivity 0 (%) : Percentage of ZMWs that are empty, with no polymerase. 零膜波导孔没有被聚合酶填充,是空的。 Productivity 1 (%) : Percentage of ZMWs that are productive and sequencing. 零膜波导孔被聚合酶填充满,可开展测序。 Productivity 2 (%) : Percentage of ZMWs that are not P0 (empty) or P1 (productive). This may occur for a variety of reasons and the sequence data is not usable. 零膜波导孔填充值既不是 P0 (empty) 也不是 P1 (productive)。这可能是由多方面的原因导致的、且测序数据不可用。 Resequencing - Coverage Report Coverage : The mean depth of coverage across the reference sequence. 总测序数据量相对参考基因组序列的平均覆盖度(平均测序深度)。 Missing Bases (%) : The percentage of the reference sequence that has zero coverage. 参考基因组序列中完全没有被覆盖到的区域,即该区域测序深度为0。 Resequencing - Mapping Report Post-Filter Reads : The number of reads that passed filtering. 过滤后的reads数。 Mapped Reads : The number of post-filter reads that mapped to the reference sequence. 过滤后的reads中,可比对至参考基因组序列上的reads数。 Mapped Subreads : The number of post-filter subreads that mapped to the reference sequence. 过滤后获得的subreads中, 可比对至参考基因组序列上的 subreads 数。 Mapped CCS Reads : The number of post-filter CCS reads that mapped to the reference sequence. CCS即为consensus sequence,由来自同一个ZMWs的subreads比对获得。 这里是指过滤后,可比对至参考基因组序列上的CCS序列数。 Mapped Subread Bases : The number of post-filter bases from all subreads that mapped to the reference sequence. This does not include adapters. 过滤后,可比对至参考基因组序列上的subreads的总碱基数。这里不包含adapters。 Mapped CCS Read Bases : The number of post-filter CCS read bases that mapped to the reference sequence. This does not include adapters. 过滤后,可比对至参考基因组序列上的CCS的总碱基数。 这里不包含adapters。 Mapped Subread Accuracy : The mean accuracy of post-filter subreads that mapped to the reference sequence. 过滤后,可 比对至参考基因组序列上的subreads的平均准确性。 Mapped CCS Read Accuracy : The mean accuracy of post-filter CCS reads that mapped to the reference sequence. 过滤后,可比对至参考基因组序列上的CCS的 平均准确性。 Mapped Subread Length : The mean read length of post-filter subreads that mapped to the reference sequence. This does not include adapters. 过滤后,可 比对至参考基因组序列上的subreads的平均长度。 这里不包含adapters。 Mapped Read Length of Insert : The mean read length of all insert sequences, which includes only mapped sequences. The read length of insert is approximately the longest subread length per ZMW. 过滤后,可比对至参考基因组序列上的所有插入片段的平均长度。在同一个ZMW中,插入片段的长度大约是该ZMW中最长的subread的长度。 Mapped Polymerase Read Length : The mean read length of post-filter polymerase reads that mapped to the reference sequence. This includes adapters. 过滤后,可比对至参考基因组序列上的测序reads的长度, Polymerase Read是包含adapters的。 Mapped Polymerase Read Length 95% : The 95th percentile of read length of post-filter polymerase reads that mapped to the reference sequence. 过滤后,可 比对至参考基因组序列上, 比对 率在95%的 polymerase reads的 长度。 Mapped Polymerase Read Length Max : The maximum read length of post-filter polymerase reads that mapped to the reference sequence. 过滤后,可 比对至参考基因组序列上的最长的 polymerase reads的 长度。 Mapped Full Subread Length : The average of the lengths of full subreads that mapped to the reference sequence. Full subreads are subreads flanked by two adapters. 过滤后, 可 比对至参考基因组序列上的 full subreads的平均长度。 full subreads两侧均包含adapter。 Analysis - Variants Report Reference : The name of the reference sequence. Reference Length : The length of the reference sequence. Bases Called (%) : The percentage of reference sequence that has ≥ 1x coverage. % Bases Called + % Missing Bases should equal 100. Consensus Accuracy : The accuracy of the consensus sequence compared to the reference. Base Coverage : The mean depth of coverage across the reference sequence. Analysis - Top Variants Report Sequence : The name of the reference sequence. Position : The position of the variant along the reference sequence. Variant : The variant position, type, and affected nucleotide. Type : The variant type: Insertion, Deletion, or Substitution. Coverage : The coverage at position. Confidence : The confidence of the variant call. Genotype : Includes the full number of chromosomes (diploid) or half the number (haploid). Assembly - Iterations Report Assembly Iterations : The number of iterations of overlap-layout-consensus performed by the de novo or hybrid assembly algorithm. Assembly - Draft Assembly Report Draft Contigs : The number of contigs output by Celera Assembler, which may include singleton and degenerate contigs. After assembly polishing with Quiver, the final number of contigs may be smaller. N50 Contig Length : The length L of the contig for which 50% of all bases in the final contigs are of length greater than L. Reads Assembled (%) : The fraction of all reads that are assembled into contigs in the final assembly. Max Contig Length : The length of the longest contig in the final assembly. Sum of Contig Lengths : The sum of the lengths of all contigs in the final assembly. Hybrid Assembly - Assembly Iterations Report Input Contigs : The number of contigs used as input to the AHA algorithm. Min Align Score : The minimum alignment score between a read and a contig to use the alignment for scaffolding. Min Link Redundancy : The minimum number of reads that must link two contigs for those contigs to be connected in a scaffold. Min Subread Length : The minimum length required for a subread to be used by the AHA algorithm. Min Contig Length : The minimum length required for a contig to be used by the AHA algorithm. Scaffolds Across Assembly Iterations : The number of scaffolds at a particular iteration of the AHA algorithm. Linking Reads Across Assembly Iterations : The number of linking reads at a particular iteration of the AHA algorithm. Hybrid Assembly - Final Assembly Report Number : The number of scaffolds, contigs, or gaps in the initial or final assembly. Max Length : The length of the longest scaffold, contig, or gap in the initial or final assembly. N50 Length : The length L of the scaffold, contig, or gap for which 50% of all bases in the initial/final scaffold/contig/gap are of length greater than L. Sum Length : The sum of the lengths of all scaffolds, contigs, or gaps in the initial or final assembly. Initial Scaffolds : The distribution of the lengths of the scaffolds sequences before completing the AHA algorithm. Scaffolds are composed of contigs optionally separated by gap sequences. Final Scaffolds : The distribution of the lengths of the scaffolds sequences after completing the AHA algorithm. Scaffolds are composed of contigs optionally separated by gap sequences. Initial Contigs : The distribution of the lengths of the contig sequences before completing the AHA algorithm. Contigs are stretches of continuous sequence that do not contain gaps. Final Contigs : The distribution of the lengths of the contig sequences after completing the AHA algorithm. Contigs are stretches of continuous sequence that do not contain gaps. Initial Gaps : The distribution of the lengths of the gaps between contig sequences before completing the AHA algorithm. Final Gaps : The distribution of the lengths of the gaps between contig sequences after completing the AHA algorithm. Base Modifications - Motifs Report Motif : The nucleotide sequence of the methyltransferase recognition motif, using the standard IUPAC nucleotide alphabet. Modified Position : The position within the motif that is modified. The first base is 1. Example: The modified adenine in GATC is at position 2. Modification Type : The type of chemical modification most commonly identified at that motif. These are: 6mA, 4mC, 5mC, or modified_base (modification not recognized by the software.) % Motifs Detected : The percentage of times that this motif was detected as modified across the entire genome. # Of Motifs Detected : The number of times that this motif was detected as modified across the entire genome. # Of Motifs In Genome : The number of times this motif occurs in the genome. Mean Modification QV : The mean modification QV for all instances where this motif was detected as modified. Mean Motif Coverage : The mean coverage for all instances where this motif was detected as modified. Partner Motif : For motifs that are not self-palindromic, this is the complementary sequence. Assembly - Pre-Assembly Report Seed Bases : The number of bases from seed reads. Pre-Assembled Yield : The percentage of seed read bases that were successfully aligned to generate pre-assembled reads. Pre-Assembled Read Length : The average length of the pre-assembled reads. Length Cutoff : Reads with lengths greater than the length cutoff are used as seed reads for pre-assembly. Pre-Assembled Bases : The number of bases in the pre-assembled reads. Pre-Assembled Reads : The number of reads output by the pre-assembler. Pre-assembled reads are very long, highly accurate reads that can be used as input to a de novo assembler. Pre-Assembled N50 : The N50 read length of the pre-assembled reads. 待继续更新。
个人分类: PacBio RS II平台数据分析|16628 次阅读|0 个评论
[转载]关于PacBio RS II reads analysis名词释义(二)
alinatingting 2015-3-25 10:01
SMRT ® Portal Help Pacific Biosciences Terminology General Terminology Adapters : Hairpin loops that are ligated to both ends of the double stranded DNA insert. When adapter sequences are removed, the read is split into multiple subreads . 即类似发夹结构的SMRT bell adapters,在文库构建时需要连接至双链DNA模板的平末端。去除adapters后,所获得即为 multiple subreads。 Movie : Real-time observation of a SMRT Cell. 即测序一个SMRT cell实时观察时长。 Read : A contiguous sequence generated from a ZMW that includes an insert sequence and may include an adapter sequence. A read is composed of alternating subreads and adapters. 指从零膜波导孔测序获得的连续的序列,其包含insert DNA序列(靶序列,即subreads)、接头序列。 Sequencing ZMW : A ZMW that is expected to be able to produce a sequence if it is populated with a polymerase. ZMWs used for automated SMRT Cell alignment are not considered sequencing ZMWs. 零膜波导孔中被聚合酶填满,可以测序获得read,即为可测序零膜波导孔。 Subread : Sequence generated by splitting the raw sequence from a ZMW by the adapters. This is the post-sequencing version of the “insert DNA” used in sample preparation. 即 insert DNA序列, 靶序列。 Zero-Mode Waveguide (ZMW) : A nanophotonic device for confining light to a small observation volume that can be, for example, a small hole in a conductive layer whose diameter is too small to permit the propagation of light in the wavelength range used for detection. 即零膜波导孔。 Primary Analysis Terminology Adapter Screening : Annotates adapter read locations. Used to break a read into subreads during secondary analysis mapping and Circular Consensus. 鉴定adapter的位置。在标准分析比对和 Circular Consensus分析中,将每条read的adapter去除获得subreads。 High Quality Region Screening : Annotates the high quality sequencing regions of a read to be used during Raw Read Trimming. 在 Raw Read Trimming环节,鉴别每条read的高质量测序区域。 Insert Screening : Annotates insert DNA regions in the raw read. 在raw read中鉴别哪段序列为insert DNA。 Quality Value Assignment : A prediction of the error probability of a basecall. 评估每个碱基的质量。 Quality Value (QV) : The total probability that the basecall is an insertion or substitution or is preceded by a deletion. QV = -10 * log10(p) Insertion QV : The probability that the basecall is an insertion with respect to the true sequence. Deletion QV : The probability that a deletion error occurred before the current base. Substitution QV : The probability that the basecall is a substitution. Raw Read Trimming : Extraction of high quality regions from a raw read. This results in a read. Read Quality Assignment : A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to-noise ratio, average base QV, interpulse duration, and so on). This is used during secondary analysis filtering. Secondary Analysis Terminology Consensus : Generation of a consensus sequence from multiple-sequence alignment. De Novo Assembly : Assembly of all subreads without a reference sequence. Filtering : Removes reads that do not meet the Read Quality and Read length parameters set by the user. The current default filtering parameters defined by Pacific Biosciences are: Read Quality ≥ .75 (as of SMRT Analysis v1.3.1) Read length ≥ 50 bases Mapping : Local alignment of a read or subread to a reference sequence. Accuracy Terminology Circular Consensus Accuracy : Accuracy of the circular consensus read. Consensus Accuracy : Accuracy of the consensus sequence compared to the reference. Read Quality : A trained prediction of a read’s mapped accuracy based on its pulse and base file characteristics (peak signal-to-noise ratio, average base QV, interpulse duration, and so on). Single Molecule Raw Accuracy : Accuracy based on one pass on one single molecule. Subread Accuracy : The post-mapping accuracy of the basecalls. Formula: , where errors = number of deletions + insertions + substitutions. Read Terminology De Novo Circular Consensus (CCS) Read : The consensus sequence produced from the alignment of subreads taken from a single ZMW. This is not aligned against a reference sequence. Raw Read : All base calls from a ZMW. Includes insert DNA and adapter sequence. Single Molecule Variant Detection (SMVD) Read : The consensus sequence produced using all subreads taken from a single ZMW and aligned to a known reference sequence. (This was formerly known as RCCS .) Read Length Terminology Mapped Read length : The distance between the first aligned base and the last aligned base in a raw read, inclusive of insert and adapter alignments. Mapped Subread Read length : The length of the subread alignment to a target reference sequence. This does not include the adapter sequence. Read length : The total number of bases produced from a ZMW after trimming. This may include the adapter sequence.
个人分类: PacBio RS II平台数据分析|4626 次阅读|0 个评论
[转载]关于PacBio RS II reads analysis(一)
alinatingting 2015-3-25 09:50
SMRT ® Portal Help What is SMRT Portal and how do I use it? Use SMRT Portal to perform secondary analysis of sequencing data generated by one or more PacBio System runs. You create and submit jobs . Jobs specify the SMRT Cells whose data will be analyzed, as well as which analysis protocols to use. After the job has completed, you then view the secondary analysis data generated. Working with SMRT Portal Create and submit a job. View the secondary analysis data generated. Create a hybrid assembly using high-confidence contigs. Open , monitor , or delete jobs. Export metrics and table data. Change your password and restore table settings Reports generated by SMRT Portal SMRT Portal reports Administrating and Managing SMRT Portal For the following functions, you must be logged in as a scientist or administrator : Managing secondary analysis protocols Managing reference sequences Importing raw data from SMRT Cells for analysis Importing SMRT Pipe jobs For the following functions, you must be logged in as an administrator : Managing application users Managing groups Specifying site-wide application settings Archiving and restoring jobs Reference SMRT Portal hardware/software requirements Protocols provided by Pacific Biosciences Pacific Biosciences software overview Pacific Biosciences terminology For troubleshooting information, see http://github.com/PacificBiosciences/SMRT-Analysis/wiki/Troubleshooting-the-SMRT-Analysis-Suite For additional technical support, contact Pacific Biosciences at TechSupport@pacificbiosciences.com or 877-920-7222.
个人分类: PacBio RS II平台数据分析|2930 次阅读|0 个评论
千年基因PacBio RS II三代测序率先升级,读长及通量显著提升
alinatingting 2014-12-5 12:24
作为全球首批使用 PacBio 最新试剂 P6-C4 的公司,千年基因通过对实验条件的不断优化及实验流程的严格控制已率先实现 PacBio RS II 三代测序的升级,读长及通量均得到显著提升。 千年基因 PacBio RS II 三代测序完美升级后,平均读长达 11Kb 以上, reads N50 长度达 16Kb 以上,每个 SMRT Cell 的测序通量高达 1Gb ,远高于 PacBio 官方的参考标准。更长读长和更高通量将有利于基因组 de novo 测序、宏基因组测序、全长转录本测序、全长 16S rDNA 测序等项目的开展。 千年基因的 PacBio RS II 三代测序自提供服务以来,已与国内大量科研单位合作开展了诸多动植物及微生物基因组 de novo 测序项目。同时,千年基因将首次应用三代平台完成人类基因组 de novo 测序,并利用三代平台长读长的优势组装得到最高质量的亚洲人参考基因组图谱,以便于亚洲人致病变异的深入挖掘。 来源于 千年基因官网 。
个人分类: 公司资讯|2913 次阅读|0 个评论
Fastq 格式说明 & (Phred33 or Phred64)
热度 1 jiewencai 2014-7-20 22:05
Fastq格式是一种基于文本的存储生物序列和对应碱基(或氨基酸)质量的文件格式。最初由桑格研究所( Wellcome Trust Sanger Institute )开发出来,现已成为存储高通量测序数据的事实标准。以Illumina Casava 1.8+ 的fastq格式为例,fastq格式的形式如下: 每条序列由4行字符表示,上述样例显示有两条序列: 第一行:必须以“@”开头,后面跟着唯一的序列ID标识符,然后跟着可选的序列描述内容,标识符与描述内容用空格分开。 第二行:序列字符(核酸为 +,蛋白为氨基酸字符)。 第三行:必须以“+”开头,后面跟着可选的ID标识符和可选的描述内容,如果“+”后面有内容,该内容必须与第一行“@”后的内容相同。 第四行:碱基质量字符,每个字符对应第二行相应位置碱基或氨基酸的质量,该字符可以按一定规则转换为碱基质量得分,碱基质量得分可以反映该碱基的错误率。这行的字符数与第二行中的字符数必须相同。字符与错误率的具体关系见下文介绍。 在满足上述要求的前提下,不同的测序仪厂商或数据存储商对第一行和第四行的定义有些差别。 第一行,即标识行在Illumina和NCBI SRA中的样式如下: Illumina casava 1.8+(详细的解释可参考 wiki ): @HWI-ST1276:97:D1DCYACXX:7:1101:1406:2170 1:N:0:CGACGT NCBI SRA: @SRR387514.1 ILLUMINA-C4D679_0049_FC:1:12:3317:1141 length=40 对于第四行的编码,最初由Phred程序的开发者定义,一般称为Phred qualitiy. 在Illumina早起版本(v1.3,v1.4)中,因为对quality的定义与Phred的不同,这行应该称为 Solexa quality。但从Illumina v1.5以后,也开始采用Phred的定义。 碱基质量得分是怎么来的? Phred最初是一个从测序仪中产生的荧光记录数据 中识别碱基的程序。在早起的荧光染料测序中,每次发生碱基合成时会释放出荧光信号,该信号被CCD图像传感器捕获。记录下荧光信号的峰值,生成一个实时的轨迹数据(chromatogram)。因为不同的碱基用不用的颜色标记,检测这些峰值即可判断出对应的碱基。但由于这些信号的波峰、密度、形状和位置等是不连续或模糊的,有时很难根据波峰判断出正确的碱基。 图1 chromatogram样图 Phred计算许多与波峰大小和分辨率相关的参数,根据这些参数,从一个巨大的查询表中找出碱基质量得分。这个查询表是根据对已知序列的测序数据分析得到的(应该是分析得到波峰参数与碱基错误率的关系,再通过公式把错误率转换成质量得分,得到波峰参数与质量得分的直接对应表)。不同的测序试剂和机器用不同的查询表。为了节约磁盘空间,质量得分(可能占用两个字符)按一定规则(Phred+33或Phred+64)被转换为单个字符表示。 碱基错误率与质量得分的关系有如下两种: Qphred = -10log10 p Qillumina-prior to v.1.4 = -10log10 (p/(1-p)) 图 2 质量得分Q和错误率p的关系,红色的为phred,黑色的为Illumina早期版本,虚线表明p=0.05,对应的质量得分为Q≈13 在不同版本的编码中,除了质量得分与错误率有所差别外,在字符与得分的转换上也有差别。 图3 不同版本质量得分与质量字符ASCII值的关系 质量字符的ASCII值和质量得分的关系有如下两种: Phred+64 质量字符的ASCII值 - 64 Phred+33: 质量字符的ASCII值 - 33 可以粗略分为 Phred+33和Phred+64,这里的33和64就是指ASCII值转换为得分该减去的数值。 在处理测序数据时,因为一些软件会根据碱基质量得分的不同做不同的处理,常要指定正确的编码方式,有必要对质量字符与质量得分的关系(Phred+33或Phred+64)作出正确的判断。当然,如果处理的是最近两年产生的测序数据,基本上都是Phred+33的,但从NCBI SRA数据库下载的旧数据就不一定了。 根据图3中Phred+33与Phred+64所使用的质量字符范围的不同,可以对fastq文件中质量得分的编码方式做出判断。图3中显示,ASCII值小于等于58(相应的质量得分小于等于25)对应的字符只有在Phred+33的编码中被使用,所有Phred+64所使用的字符的ASCII值都大于等于59。在通常情况下,ASCII值大于等于74的字符只出现在Phred+64中。利用这些信息即可在程序中进行判断。 文章末尾是一个对Phred+33或Phred+64做区分的perl脚本。 该脚本的判断思想如下: 默认读取1000条序列,在这1000条序列中: 1. 如果有2个以上的质量字符ASCII值小于等于58(即有两个碱基的得分小于等于25),同时没有任何质量字符的ASCII值大于等于75,即判断是Phred+33。 2. 如果有2个以上的质量字符ASCII值大于等于75(即有两个碱基的得分大于等于10),同时没有任何质量字符的ASCII值小于等于58,即判断是Phred+64。 3. 如果所有质量字符的ASCII值介于59到74之间,即判断可能是Phred+33,但建议使用更多的序列做进一步测试(出现这种结果可能有两种情况:1, Phred+33编码,所有碱基质量得分介于26到42之间;2,Phred+64编码,所有碱基质量得分介于-5到10;是前者的可能性大)。 4. 如果出现上述3种以外的情况,建议打印出质量字符的ASCII值人工判断。 理解错误的地方欢迎指正。 fastq_phred.pl 参考资料: 1. https://en.wikipedia.org/wiki/FASTQ_format 2. https://en.wikipedia.org/wiki/Phred_quality_score 3. https://en.wikipedia.org/wiki/Phred_base_calling 4. http://maq.sourceforge.net/fastq.shtml 5. http://maq.sourceforge.net/qual.shtml 6. http://supportres.illumina.com/documents/myillumina/a557afc4-bf0e-4dad-9e59-9c740dd1e751/casava_userguide_15011196d.pdf
个人分类: Bioinformatics|37244 次阅读|1 个评论
NGS数据的质量评估和reads的处理
bigdataage 2014-7-7 14:23
NGS数据的质量评估和reads的处理 转自: http://www.hzaumycology.com/chenlianfu_blog/?p=1456 http://blog.csdn.net/shmilyringpull/article/details/9225195 1. 基因组测序和转录测序的NGS数据处理策略 从测序公司拿到数据后,首先需要对数据进行预处理,主要分两步走: 1.1 QC(reads的质量控制) Quality Control, 即过滤低质量reads, 低质量的reads有如下几种: 含有Primer/Adaptor的reads 含有过多non-ATCG碱基N的reads 测序质量较低的碱基数占的比例过高的reads 需要将这些reads完全过滤掉,才能用于下一步的分析。 1.2 对reads进行trim处理 如果进行基因组组装,则不需要进行该步骤。如果是需要进行转录组的分析,则必须要该步骤。 本步骤从3′端来对reads进行trim,来控制reads中低质量碱基的比例。直到trim的read长度低于一定的数时,则完全舍弃该read。 2. NGS数据的QC软件 2.1 NGSQC toolkit 该软件的citation: Patel RK, Jain M (2012). NGS QC Toolkit: A toolkit for quality control of next generation sequencing data. PLoS ONE, 7(2): e30619. 该软件的官网: http://www.nipgr.res.in/ngsqctoolkit.html 该软件解压缩后包括4个文件夹和1个PDF格式的manual文件。manual文件是详细的说明;4个文件夹中都是使用perl编写的用于QC的程序。按其重要程度决定先后,其介绍如下: 2.1.1 QC文件夹中包含了4支PERL程序,用于454 READS或ILLUMINA READS的QC,分别为: IlluQC.pl 用于Illumina reads的QC。默认情况下去除掉含有primer/adaptor的reads和低质量的reads,并给出统计结果和6种图形结果。默认设置 (‘-s’ 参数) 碱基质量低于20的为低质量碱基;默认设置 ( ‘-l’ 参数)低质量碱基在reads中比例 30% 的为低质量reads。程序运行例子: $ perl $NGSQCHome/QC/IlluQC_PRLL.pl -pe r1.fq r2.fq 2 5 -p 8 -l 70 -s 20 IlluQC_PRLL.pl 和上一个程序没有多大区别,只是多了 ‘-c’ 参数来进行并行计算,增加程序速度。 454QC.pl 对454 reads进行QC。 454QC_PRLL.pl 和上一个程序一眼个,只是多了 ‘-c’ 参数来进行并行计算,增加程序速度。 454QC_PE.pl 对paired-end测序的454 reads进行QC。 2.1.2 TRIMINGREADS文件夹包含3支程序,用于READS的TRIMMING,分别为: AmbiguityFiltering.pl 对含有non-ATCG的reads进行trimming的程序。有4种(4选1)trim方法:允许最大non-ATCG数目;允许最大的non-ATCG比例(例子如下);从5′端trim掉含N的序列;从3′端trim掉含N的序列。加上个通用的参数:低于一定长度的reads被cutoff掉。 $ perl $NGSQCHome/Trimming/AmbiguityFiltering.pl -i r1.fq -irev r2.fq -p 2 -n 50 TrimmingReads.pl 有3种(3选1)trim方法:对所有read从5′端trim掉制定数目的碱基;对所有reads从3′端trim掉指定数目的碱基;从3′端trim掉质量低于指定值的碱基(例子如下)。加上个通用的参数:低于一定长度的reads被cutoff掉。 $ perl $NGSQCHome/Trimming/TrimmingReads.pl -i r1.fq -irev r2.fq -q 13 -n 50 HomopolymerTrimming.pl 2.1.3 STATISTICS文件夹中2支程序,用于进行N50统计等 N50Stat.pl 用于统计fasta文件的N50 AvgQuality.pl 用于统计454文件的reads质量 2.1.4 FORMT-CONVERTER文件夹中程序运用于不同格式文件的转换,其中含有4个PERL程序,分别为: FastqTo454.pl、FastqToFasta.pl、SangerFastqToIlluFastq.pl、SolexaFastqToIlluFastq.pl。
8290 次阅读|0 个评论
off target reads of exome sequencing
热度 1 skytnn 2013-11-24 11:07
Finding the lost treasures in exome sequencing data Volume 29, Issue 10 , October 2013, Pages 593–599 Human Genetics
个人分类: RNA sequencing|2344 次阅读|1 个评论
phred33 or phred64
热度 1 jiewencai 2013-7-20 00:06
已作更新,fastq格式的详细说明和判断phred33或phred64的perl脚本请参见新博文: http://blog.sciencenet.cn/home.php?mod=spaceuid=630246do=blogid=813262
个人分类: Bioinformatics|12128 次阅读|2 个评论
[转载]华大发Genome Research 鉴定indel--SOAPindel
bioseq 2012-9-14 14:15
SOAPindel: Efficient identification of indels from short paired reads Shengting Li, Ruiqiang Li, Heng Li, Jianliang Lu, Yingrui Li, Lars Bolund, Mikkel Schierup, and Jun Wang Genome Res. published 12 September 2012; doi:10.1101/gr.132480.111 http://genome.cshlp.org/content/early/2012/09/12/gr.132480.111.abstract.html We present a new approach to indel calling which explicitly exploits that indel differences between a reference and a sequenced sample make the mapping of reads less efficient. We assign all unmapped reads with a mapped partner to their expected genomic positions and then perform extensive de novo assembly on the regions with many unmapped reads to resolve homozygous, heterozygous and complex indels by exhaustive traversal of the de Bruijn graph. The method is implemented in the software SOAPindel and provides a list of candidate indels with quality scores. We compare SOAPindel to Dindel, Pindel and GATK on simulated data and find similar or better performance for short indels (10 bp) and higher sensitivity and specificity for long indels. A validation experiment suggests that SOAPindel has a false positive rate around 10% for long indels (5 bp) while still providing many more candidate indels than other approaches.
2530 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-2 01:44

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部