科学网 › 标签 › ngs

标签: ngs

相关帖子	版块	作者	回复/查看	最后发表

没有相关内容

相关日志

基于NGS的产前诊断技术原理: 热度 1 xbinbzy 2015-12-21 17:02; 目前基于新一代测序技术（Next Generation Sequencing, NGS）在临床充分利用的产前筛查方法（或者叫唐氏筛查）的原理最早在PNAS的两篇文章，通在2008年上发表，相关的数据处理策略略有差别，两篇文章分别是 “Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma” 和“Noninvasive diagnosis of fetal aneuploidy by shotgunsequencing DNA from maternal blood“，两个团队基于各自的技术手段，先后都成立了提供唐氏筛查的技术服务公司。文章 Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma中的数据处理原理如下：第一步： Fetal DNA (thick red fragments) circulates inmaternal plasma as a minor population among a high background of maternal DNA (black fragments). A sample containing a representative profile of DNA molecules in maternal plasma is obtained. 在母体血液中有游离的胎儿DNA片段，在上述途中，红色的片段代表胎儿DNA片段。从图上可以看出，母体血液中胎儿DNA的浓度非常低，为此科研工作者的目的在于用数据处理的方法能够把信息放大到准确检测的地步。第二步： In this study, one end of eachplasma DNA molecule was sequenced for 36 bp using the Solexa sequencing-by-synthesis approach. 当时用Solexa测序技术测取36bp的reads，目前主要基于Illumina、CG和Proton的测序技术，读长和深度都有所增加。第三步： The chromosomal origin of each 36-bp sequence was identified through mapping to the human reference genome by bioinformatics analysis.将测取到的reads进行mapping操作，比对到人的参考序列上。第四步： The number of unique sequences mapped to each chromosome was counted and then expressed as a percentage of all unique sequences generated for the sample, termed % chrN for chromosome N. 统计每条染色体上uniq比对的reads数目，同时计算每条染色体上uniq reads的比例。第五步：Z-scores for each chromosome and each test sample were calculated using the formula shown. 计算一个z－score值，利用染色体上uniq reads的比例减去control样本中对应此条染色体uniq reads比例的平均值，再除以control样本中此条染色体uniq reads的标准差。例如，21三体筛查，假设有100个已知怀正常胎儿的样本，通过前4步的计算，可以得到每个样本在21号染色体上，uniq reads的比例具体值，如此可以计算这100个样本在21号染色体的平均值和标准差。如果1个新的样本要判断是不是21三体，可以根据上述公式去计算21号染色体的z-score，超过界定界限则是患病胎儿，反之正常。其中就可以看到，正常样本数越多，平均值和标准差就越准（基线越来越标准），如此得到的z-score更准确。通过以上方法最终的结果如下：从上图可以看出，如果利用染色体上的uniq reads比例不能很好区分正常胎儿和3体胎儿，利用z-score的计算，可以发现3体胎儿的z-score超过3，正常胎儿都位于3以上，为此达到很理想的检测目的。文章 Noninvasive diagnosis of fetal aneuploidy by shotgunsequencing DNA from maternal blood 中的数据处理原理和结果如下：第一步：We obtained on average 􏰟10 million 25-bp sequence tags per sample. 测序获取数据；第二步：将每条染色体按50kb划分bin，根据bin中uniq reads的数目进行排序，取中位数代表此条染色体的reads数目；第三步：每个样本有21条染色体的reads数目，再对21个数据排序，取中位数代表此样本的reads数目；第四步：用每条染色体的reads数目除以每个样本的reads数目得到归一化的一个值；第五部：利用第四步得到的诡异化值去计算置信区间，当样本的值不在此置信区间时，被判定为异常样本。利用此处理策略，可发现针对21三体，能够较理想区分正常胎儿和异常胎儿样本。此处理策略也依赖于正常样本的数目，正常样本越多，计算得到的置信区间会越准确，判断的结果也就会越准。参考文章： http://www.pnas.org/content/early/2008/10/03/0808319105 http://www.pnas.org/content/105/51/20458.abstract; 个人分类: 科研文章|6173 次阅读|1 个评论

[转载]HOMER_Software for motif discovery and next-gen sequencing: ab513467348 2014-9-28 08:35; HOMER Software for motif discovery and next-gen sequencing analysis Mapping reads to the genome Once you have checked your FASTQ files and have removed all adapter sequences that might be present, you are ready to map them to a reference genome. While tools like BLAST and BLAT are powerful methods, they are not specialized for the vast amount of data generated by next-generation sequencers. It is highly recommended that you use a next-gen specific read alignment program. Note: While BWA, Bowtie, and Tophat have received the most attention as short read alignment algorithms, new methods such as STAR are significantly faster and in some cases more accurate. More like it are likely to come along soon (if not already available...) Selecting a reference genome Both the organism and the exact version (i.e. hg18, hg19) are very important when mapping sequencing reads. Reads mapped to one version are NOT interchangeable with reads mapped to a different version. I would follow this recommendation list when choosing a genome (Obviously try to match species or sub species when selecting a genome): Do you have a favorite genome in the lab that already has a bunch of experiments mapped to it? Use that one. Do any of your collaborators have a favorite genome/version? Use the latest stable release - I would recommend using genomes curated at UCSC so that you can easily visualize your data later using the UCSC Genome Browser . (i.e. mm9, hg19) Mapping to a normal genomic reference You want to map your reads directly to the genome if you are performing: ChIP-Seq GRO-Seq DNase-Seq MNase-Seq Hi-C Popular short read aligners Full List Most Popular: bowtie : fast, works well bowtie2 : fast, can perform local alignments too BWA - Fast, allows indels, commonly used for genome/exome resequencing Subread - Very fast, (also does splice alignment) STAR - Extremely fast (also does splice alignment, requires at least 30 Gb memory) To be honest, I would probably recommend STAR for almost any application at this point if you have the memory ( see below ) Example of alignment with bowtie2: Step 1 - Build Index (takes a while, but only do this once): After installing bowtie2 , the reference genome must first be indexed so that reads may be quickly aligned. You can download pre-made indices from the bowtie website (check for those here first). Please be aware that bowtie2 indexes are different than bowtie indexes. To perform make your own from FASTA files, do the following: Download FASTA files for the unmasked genome of interest if you haven't already (i.e. from UCSC ). Do NOT use masked sequences. I also tend to remove the *_random.fa chromosomes. These often contain part of the canonical chromosomes in addition to regions that cannot be placed in the assembly. The problem with these regions is that the part shared with the canonical chromosome will be present twice, making it difficult to map the reads to a unique location. Concatenate FASTA files into a single file. We can do this using the UNIX cat command, which merges files together cat *.fa genome.fa From the directory containing the genome.fa file, run the bowtie2-build command. The default options usually work well for most genomes. For example, for hg19: /path-to-bowtie-programs/bowtie2-build genome.fa hg19 This command will create 6 files with a *.bt2 file extension. These will then be used by bowtie2 or newer versions of tophat to map data. Copy the *.bt2 files to the bowtie2 indexes directory (or somewhere you can store them) so that bowtie2 knows where to find them later: cp *.bt2 /path-to-bowtie-programs/indexes/ Step 2 - Align sequences with bowtie (perform for each experiment): The most common output format for high-throughput sequencing is FASTQ format , which contains information about the sequence (A,C,G,Ts) and quality information which describes how certain the sequencer is of the base calls that were made. In the case of Illumina sequencing, the output is usually a s_1_sequence.txt file. More recently the Illumina pipeline will output a file that is debarcoded with your sample name such as Experiment1.fastq. In addition, much of the data available in the SRA , the primary archive of high-throughput sequencing data, is in this format. To use bowtie2 to map this data, run the following command: /path-to-bowtie-programs/bowtie2 -p # cpu -x genome index prefix fastq file output filename i.e. /programs/bowtie2 -p 8 -x hg19 Experiment1.fastq Experiment1.sam Where genome index prefix is the common prefix for the *.bt2 files that were created using the bowtie2-build command in step 1, or from a downloaded index. If the *.bt2 files are stored int the /path-to-bowtie2-program/indexes/ directory, you only need to specify the name of the index. If the index files are located elsewhere, you can specify the full path names of the index files (in the examples above this would be -x /programs/indexes/hg19). In the example above, we use 8 processors/threads. The bowtie2 program is very parallel in nature, with near linear speed up with additional processors. The default output is a SAM file. To learn more about SAM alignment files, go to the next section on SAM/BAM files. There are many other options for bowtie2 that may be important for your study, but most ChIP-Seq data can be mapped using the default options. NOTE : Usually, the process of removing duplicate reads or removing non-unique alignments is handled by the downstream analysis program. Example of alignment with bwa: Step 1 - Build Index (takes a while, but only do this once): bwa index -a bwtsw genome.fa Step 2 - Align reads to the index (perform for each experiment): # where the genome.fa is in the same directory with your index from the first step bwa mem -t #cpus genome.fa reads.fq aln-se.sa #paired end bwa mem -t #cpus genome.fa reads1.fq reads2.fq aln-pe.sam Mapping to a genome while allowing splicing Usually, any kind of RNA-seq method will benefit from looking for splicing junctions in addition to genomic mapping: RNA-Seq (polyA+, total) CLIP-Seq/PAR-CLIP RIP-Seq ChIRP-Seq Ribo-Seq Popular splice read aligners Tophat (most popular) Subread - Very fast, (also does splice alignment) STAR - Extremely fast (also does splice alignment, requires at least 30 Gb memory) lots of others... I would probably recommend STAR for RNA-Seq is you have enough RAM ( see below ) Example of aligning RNA-Seq data with STAR (very very fast) STAR is one of a growing number of short read aligners that takes advantage of advances in computational power to optimize the short read mapping process (original publication: Dobin et al. ) The key limitation with STAR is computer RAM - STAR requires at least 30 Gb to align to the human or mouse genomes. To install STAR, visit the website and follow their instructions. Step 1 - Build a genome index Like all aligners, you need to build the genome index first. The STAR website has links to the hg19 genome index if you want to skip this step. First, make a directory for the index (i.e. mm9-starIndex/). Then copy the genome FASTA file it the directory and cd into it to make that directory your current directory. Then, to build the index, the command is: STAR --runMode genomeGenerate --runThreadN # cpus --genomeDir genome output directory --genomeFastaFiles input Genome FASTA file i.e. STAR --runMode genomeGenerate --runThreadN 24 --genomeDir ./ --genomeFastaFiles mm9.fa Note: For small genomes, you may need to add the following to create a proper index: --genomeSAindexNbases 6 Step 2 - Align RNA-Seq Reads to the genome with STAR To align RNA-Seq reads with STAR, run the following command: STAR --genomeDir Directory with the Genome Index --runThreadN # cpus --readFilesIn FASTQ file --outFileNamePrefix OutputPrefix i.e. STAR --genomeDir mm9-starIndex/ --runThreadN 24 --readFilesIn Experiment1.fastq --outFileNamePrefix Experiment1Star # paried-end data: STAR --genomeDir mm9-starIndex/ --runThreadN 24 --readFilesIn read1.fastq read2.fastq --outFileNamePrefix Experiment1Star STAR will create several output files - the most important of which is the *.Aligned.out.sam - you will likely want to convert this to a bam file and sort it to use it with other programs. The default output is a SAM file. To learn more about SAM alignment files, go to the next section on SAM/BAM files. Notes on STAR STAR is very very fast - it will rip through 20 million reads in a matter of minutes if you have 20 cpus working for you. In fact, the longest part of the program is loading the index into memory. If you are aligning several experiments in a row, add the option --genomeLoad LoadAndKeep and STAR will load the genome index into shared memory so that it can use it for the next time you run the program. Also, converting the sam file into a sorted bam file will take much longer than aligning the data in the first place. Example of Alignment with Tophat (not recommended) Tophat is basically a specialized wrapper for bowtie2 - it manipulates your reads and aligns them with bowtie2 in order to identify novel splice junctions. It can also use given splice junctions/gene models to map your data across known splice junctions. Step 1 - Build Index (takes a while, but only do this once): This part is exactly the same as for bowtie2 - if you already made or downloaded an index for bowtie2, you can skip this step. Step 2 - Align your RNA-seq data to the genome using Tophat To use tophat to map this data, run the following command: /path-to-bowtie-programs/tophat -o output directory -p # cpu /path-to-genome-index/prefix fastq file For example: /programs/tophat -o TophatOutput/ -p 8 /programs/indexes/hg19 Experiment1.fastq Paired-end Example: /programs/tophat -o TophatOutputPE/ -p 8 /programs/indexes/hg19 Experiment1.r1.fastq Experiment1.r2.fastq In the example above, we use 8 processors/threads. The tophat2 program contains a mix of serial and parallel routines, so more processors doesn't necessarily mean it will be finished much faster. In particular, if you are performing coverage based searchers, it may take a long time (multiple processors will not help). In general, if you have multiple samples to map, it's better to map them at the same time with separate commands rather than mapping them one at a time with mapping processors. Tophat places several output files in an output directory. The most important is the accepted_hits.bam file - this is the alignment file that will be used for future analysis (more info here ). There are additional files that can be useful, including a junctions.bed file with records all of the splice junctions discovered in the data. Important Tophat Parameters: --library-type fr-unstranded | fr-firststrand | fr-secondstrand Describes which method was used to generate your RNA-seq library. If you used a method that doesn't produce strand specific signal (i.e. just sequencing a cDNA library), then select the fr-unstranded. If you use a stranded method that sequences the first DNA strand made (like a dUTP method), then use fr-firststrnad. If you use direct ligation methods, then fr-secondstrand. Correctly specifying the library type will help Tophat identify splice junctions. -G GTF file Use this option to specify a known transcriptome to map the reads against. By default, tophat will also search for de novo splice events, but this will help it known were the known ones are so that you don't miss them. A GTF files are called Gene Transfer Files, and a description of the format can be found here . To get a GTF file for your organism, you can usually get one from UCSC Table Browser: In the output format, be sure to select GTF file - the file you download from here should work with tophat. Tophat Mapping Strategy If your goal is to identify novel transcripts and you have several separate experiments, I would recommend pooling all of your data together into a single expeirment/FASTQ file and mapping your data in one run. One of the ways Tophat tries to identify novel junctions is by first identifying exons by mapping segments of reads to the genome using bowtie2. The more segments it's able to map, the more confident it is about putative exons and the greater the chance it will identify unannotated splice sites. You can then go back with your novel splice sites and remap the original experiments (not pooled) to get reads for each individual experiment using the -j raw junction file . This is most useful for short reads. This general strategy is also useful if running cufflinks... Mapping bisulfite-treated DNA MethylC-Seq, BS-Seq, or RBBS-Seq data requires a special mapping strategy. In these experiments the genomic DNA is bisulfite treated, causing all unmethylated cytosines to be converted to uracil, which will utimately show up as thymine after sequencing. This is a clever way to figure out which cytosines are methylated in the genome, but requires a clever mapping strategy to avoid bias detection. I'll try to include an example of mapping this type of data in the near future, for now consider this list of BS-aligners . De novo Assembly Sometimes it makes more sense to perform de novo assembly instead of mapping reads to a reference genome. Assembly is well beyond the scope of this tutorial. Genomics assembly from short reads: Velvet , SOAPdenovo Transcript assembly: Trinity Can't figure something out? Questions, comments, concerns, or other feedback: cbenner@salk.edu http://homer.salk.edu/homer/basicTutorial/mapping.html; 个人分类: 科研笔记|4820 次阅读|0 个评论

NGS 测序过程中的相关术语: chenxinzzbin 2014-7-25 16:47; SBS：边合成边测序反应，每次SBS会延伸一个碱基，大约耗时70分钟。 Run：单次上机测序反应，可以产生4G-75G测序通量不等。 Lane：单泳道，每条泳道可以直接物理区分测序样品，1次run最多可以同时上样8条Lane。 Channel： Lane的同义词。 Tile：小区，每条Lane中排有2列tile，合计120个小区。每个小区上分布数目繁多的簇结合位点。 Cluster：簇，在 Solexa 测序技术中会采用桥式PCR方式生产DNA簇，每个DNA簇才能产生亮度达到CCD可以分辨的荧光点。 Index：标签，在 Solexa 多重测序（Multiplexed Sequencing）过程中会使用Index来区分样品，并在常规测序完成后，针对Index部分额外进行7个循环的测序，通过Index的识别，可以在1条Lane中区分12种不同的样品。 Barcode: Index同义词 Fasta：一种序列存储格式。一个序列文件若以FASTA格式存储，则每一条序列的第一行以“”开头，而跟随“”的是序列的ID号（即唯一的标识符）及对该序列的描述信息；第二行开始是序列内容，序列短于61nt的，则一行排列完；序列长于61nt的，则每行存储61nt，最后剩下小于61nt的，在最后一行排列完；第二条序列另起一行，仍然由“”和序列的ID号开始，以此类推。 Fastq： Fastq是 Solexa 测序技术中一种反映测序序列的碱基质量的文件格式。第一行以“@”符号开头，后面紧跟一个序列的描述信息；第二行是该序列的内容；第三行以“+”符号开头，后面紧跟的内容与第一行一样，同样是该序列的描述信息；而第四行是第二行中的序列内容每个碱基所对应的测序质量值。 PF% ：PF%是指符合测序质量标准的簇的百分比（Multiplexed Sequencing），与测序的通量相关联。 Read： Solexa 是成簇反应的，每个簇对应一条DNA序列片段，成为一个read。转自： http://www.plob.org/2012/12/21/5166.html; 7030 次阅读|0 个评论

NGS数据的质量评估和reads的处理: bigdataage 2014-7-7 14:23; NGS数据的质量评估和reads的处理转自： http://www.hzaumycology.com/chenlianfu_blog/?p=1456 http://blog.csdn.net/shmilyringpull/article/details/9225195 1. 基因组测序和转录测序的NGS数据处理策略从测序公司拿到数据后，首先需要对数据进行预处理，主要分两步走： 1.1 QC(reads的质量控制) Quality Control, 即过滤低质量reads, 低质量的reads有如下几种：含有Primer/Adaptor的reads 含有过多non-ATCG碱基N的reads 测序质量较低的碱基数占的比例过高的reads 需要将这些reads完全过滤掉，才能用于下一步的分析。 1.2 对reads进行trim处理如果进行基因组组装，则不需要进行该步骤。如果是需要进行转录组的分析，则必须要该步骤。本步骤从3′端来对reads进行trim，来控制reads中低质量碱基的比例。直到trim的read长度低于一定的数时，则完全舍弃该read。 2. NGS数据的QC软件 2.1 NGSQC toolkit 该软件的citation: Patel RK, Jain M (2012). NGS QC Toolkit: A toolkit for quality control of next generation sequencing data. PLoS ONE, 7(2): e30619. 该软件的官网： http://www.nipgr.res.in/ngsqctoolkit.html 该软件解压缩后包括4个文件夹和1个PDF格式的manual文件。manual文件是详细的说明；4个文件夹中都是使用perl编写的用于QC的程序。按其重要程度决定先后，其介绍如下： 2.1.1 QC文件夹中包含了4支PERL程序，用于454 READS或ILLUMINA READS的QC，分别为： IlluQC.pl 用于Illumina reads的QC。默认情况下去除掉含有primer/adaptor的reads和低质量的reads，并给出统计结果和6种图形结果。默认设置 (‘-s’ 参数) 碱基质量低于20的为低质量碱基；默认设置 ( ‘-l’ 参数)低质量碱基在reads中比例 30% 的为低质量reads。程序运行例子： $ perl $NGSQCHome/QC/IlluQC_PRLL.pl -pe r1.fq r2.fq 2 5 -p 8 -l 70 -s 20 IlluQC_PRLL.pl 和上一个程序没有多大区别，只是多了 ‘-c’ 参数来进行并行计算，增加程序速度。 454QC.pl 对454 reads进行QC。 454QC_PRLL.pl 和上一个程序一眼个，只是多了 ‘-c’ 参数来进行并行计算，增加程序速度。 454QC_PE.pl 对paired-end测序的454 reads进行QC。 2.1.2 TRIMINGREADS文件夹包含3支程序，用于READS的TRIMMING，分别为： AmbiguityFiltering.pl 对含有non-ATCG的reads进行trimming的程序。有4种(4选1)trim方法：允许最大non-ATCG数目；允许最大的non-ATCG比例(例子如下)；从5′端trim掉含N的序列；从3′端trim掉含N的序列。加上个通用的参数：低于一定长度的reads被cutoff掉。 $ perl $NGSQCHome/Trimming/AmbiguityFiltering.pl -i r1.fq -irev r2.fq -p 2 -n 50 TrimmingReads.pl 有3种(3选1)trim方法：对所有read从5′端trim掉制定数目的碱基；对所有reads从3′端trim掉指定数目的碱基；从3′端trim掉质量低于指定值的碱基(例子如下）。加上个通用的参数：低于一定长度的reads被cutoff掉。 $ perl $NGSQCHome/Trimming/TrimmingReads.pl -i r1.fq -irev r2.fq -q 13 -n 50 HomopolymerTrimming.pl 2.1.3 STATISTICS文件夹中2支程序，用于进行N50统计等 N50Stat.pl 用于统计fasta文件的N50 AvgQuality.pl 用于统计454文件的reads质量 2.1.4 FORMT-CONVERTER文件夹中程序运用于不同格式文件的转换，其中含有4个PERL程序，分别为： FastqTo454.pl、FastqToFasta.pl、SangerFastqToIlluFastq.pl、SolexaFastqToIlluFastq.pl。; 8290 次阅读|0 个评论

[转载]美国奥本大学沟鲶BSR-Seq文章: jackiehu 2014-3-17 10:14; BMC Genomics. 2013 Dec 30;14(1):929. Bulk segregant RNA-seq reveals expression and positional candidate genes and allele-specific expression for disease resistance against enteric septicemia of catfish. 关键词：BSR-Seq, Bulked segregant RNA-seq. 分离群体分组转录组测序，联合分离群体分组分析(BSA)和基于NGS的转录组测序(RNA-seq)技术。物种与疾病：ESC, enteric septicemia of catfish （斑点叉尾鱼回肠型败血症)。斑点叉尾鮰，又称沟鲶、钳鱼，原产于北美洲，一种大型淡水鱼类。实验材料：4个BC1家系(抗病X感病F1与感病亲本回交)，每条鱼35g左右。对照群体：每个家系100条，共400条鱼。处理群体：每个家系300条，共1200条鱼，注入1000ml爱德华氏细菌(4×108CFU/ml)。取材：3-5天后，从2个家系群中收集死去个体为敏感鱼。2周后，同样的2个家系群中收集所有活着的个体为抗病鱼。于此同时，收集对照群体的鱼。测序方案：抗性组、敏感组、和对照组；每个组每个家系选12条，共24条鱼。3组共72条鱼，每条鱼取相同重量的肝脏组织，抽提RNA。每个组的个体所提出的RNA，等量混合，Truseq文库构建，PE100转录组测序。 Trinity软件进行de novo组装。转录组测序数据量为：抗性组，151M reads；敏感组，116M reads；对照组，132M reads。共得clead reads为374 M，de novo组装成232338条非冗余contigs，均长825bp。其中5.1万多条长度大于1Kb。基因差异表达分析发现，抗性组比较对照组时，224个基因差异表达，其中130个上调。而差异倍数大于10倍的基因有42个。敏感组对比对照组时，总计有1240个基因差异表达，其中差异倍数在10倍以上的有233个。抗性组与敏感组比较时，差异表达基因有1255个，在抗性组中上调表达的528个基因中，有4个差异倍数大于100；19个差异倍数在50-100之间；86个差异倍数在10-50之间。在抗性组中表达降低的基因中，2个超过100倍；10个在50-100之间；86个在10-50之间。使用Popoolation软件分析，鉴定出513371个SNPs；使用VarScan软件，鉴定到482035个SNPs，其中两个软件共同鉴定到465537个SNPs，这些SNPs位于31646个contigs之中。位于11249个contigs之中的56419个显著SNPs在抗性组和敏感组之间具有显著的等位频率差异。而这11249个contigs中，5480个可以比对上已知基因，代表了4304个unique基因。分析抗性组和敏感组的RNA-seq数据，得到分离群体频率比（Bulk frequency ratios,BFR）。大量含有标志SNPs的基因的BFR值大于2；总计359个基因的BFR值等于大于4；其中337个基因的BFR值在4-16之间；23个基因大于16；还有4个基因的BFR大于32. BFR值大于等于4的359个基因，所含有的组合分离等位比率在7以下，其中大都是具有1-3个组合分离等位比率，这表明BFR值大的这些基因并不是因为等位基因特异性表达，而可能是遗传分离所到导致的。共有354个BFR值大的基因被鉴定出含有显著SNPs，BLASTN搜索鲶鱼基因组草图，这354个基因在201个scaffolds之中，其中134个被定位到连锁群上。共有354个BFR值大的基因被鉴定出含有显著SNPs，BLASTN鲶鱼基因组草图（未发表），发现这些基因位于201个scaffolds之中，其中134个scaffolds已被定位到连锁群上。鉴定到8条染色体上含有沟鲶抗病性QTLs (基因数目5或含有BFR10的基因)， LG6、15和17上基因数目最多，而6条染色体含有BFR值10的基因。一般来说，蓝鲶的ESC抗病性要强于沟鲶。本研究98个SNPs组合等位比率大于14的基因中， 18个基因的亲本起源已知，11个基因起源自斑点叉尾鮰/沟鲶（channel catfish）， 7个基因起源自长鳍叉尾鮰/蓝鲶（blue catfish）。 11个优先表达沟鲶等位基因的基因中，6个在抗性鱼中高表达，5个在敏感鱼中高表达。而7个优先表达蓝鲶等位基因的基因中，4个在抗性鱼中高表达，3个在敏感鱼中高表达。本文经以上分析共找出17个同时具有高BFR值和低等位基因比的差异表达基因，这17个基因是候选抗病关键基因。本文转自：http://www.bgitechsolutions.cn/bbs/forum.php?mod=viewthreadtid=545; 个人分类: 遗传|3822 次阅读|0 个评论

ngs read error correction: zoubinbin100 2014-3-13 09:11; 今天浏览网站（ http://www.biodiscover.com/group/topic/5604.html ）时发现了 error correction，对于error correction我不是太了解，以前看到的很多RNA-Seq中并没有提到这一点，但是看到网站上面有很多的软件，觉得这也是个数据分析重要的内容，所以今天做个记录，希望以后能抽出时间去研究。; 个人分类: 待研究-NGS|918 次阅读|0 个评论

National Geodetic Survey (NGS): Master123 2013-4-19 20:08; http://www.ngs.noaa.gov/; 个人分类: Research institutions|2510 次阅读|0 个评论

[转载]QIAGEN‘s Clinical NGS effort: genesquared 2012-11-21 09:52; QIAGEN unveils initiative to create next-generation sequencing portfolio for use in clinical research and molecular diagnostics http://www.qiagen.com/about/pressreleases/pressreleaseview.aspx?PressReleaseID=384 Aim to expand next-generation sequencing beyond current focus on life sciences research QIAGEN plans to offer sample-to-result workflows that integrate its sample preparation and assay products with a next-generation benchtop sequencer and new bioinformatics Initiative combines broad range of QIAGEN products with acquisition of sequencing specialist Intelligent Bio-Systems, Inc. and a new strategic collaboration with SAP AG HILDEN, Germany, and GERMANTOWN, Maryland, USA, June 25, 2012 - QIAGEN N.V. (NASDAQ: QGEN; Frankfurt Prime Standard: QIA) today unveiled an advanced initiative to enter the field of next-generation sequencing (NGS) that aims to establish these technologies as routine processes used in new areas such as clinical research and molecular diagnostics . QIAGEN is in the advanced stages of creating sample-to-result, efficient and cost-effective NGS workflow solutions. These will combine a broad range of QIAGEN products - including automated sample preparation solutions ( nucleic acid extraction, DNA enrichment, library preparation and targeted gene analysis panels) - with a previously undisclosed next-generation benchtop sequencer in development with Intelligent Bio-Systems, Inc., a privately held company that QIAGEN has acquired. New bioinformatics, including solutions emerging from a new collaboration with SAP AG, will be incorporated into the workflows. A first sample-to-result NGS solution is expected to be launched next year, while details on specifications and launch plans are set to be released in early 2013. "The rapid advances in next-generation sequencing have enabled life science researchers to unlock many secrets about the molecular building blocks of life. Our ambition is to create a new dimension of benefits for these technologies by offering workflow solutions for clinical use, particularly to develop new medicines and improve healthcare with advanced diagnostics," said Peer M. Schatz, CEO of QIAGEN N.V. "While next-generation sequencing is viewed today mainly as a research tool, our initiative is to expand beyond this and to offer applications designed to address the needs of customers in clinical research and molecular diagnostics ." Key elements of this initiative include: Content: The development of a broad portfolio of gene panels designed for NGS analysis based on QIAGEN's extensive offering of molecular content, including GeneGlobe (www.geneglobe.com), an online portal that offers access to more than 60,000 well-defined and characterized molecular assays. In the first wave, QIAGEN plans to offer eight preconfigured gene panels for use in cancer, as well as enable customers to create customized panels for specific molecular pathways and diseases. Sample technologies: An extensive range of NGS sample preparation products is planned to be launched that are based on QIAGEN's global leadership position in sample technologies and enzymology. NGS module: A previously undisclosed NGS benchtop sequencer is in late stages of development with Intelligent Bio-Systems, Inc. (IBS), a privately held U.S. company that QIAGEN has acquired. This novel system can process multiple samples in parallel with highest flexibility and performance, and benefits from the use of proprietary sequencing by synthesis (SBS) technology exclusively licensed from Columbia University. Building on elements of previous IBS designs as well as on QIAGEN technologies, this new system - which is expected to enter beta testing with customers in 2012 - seeks to offer a new dimension of benefits and cost savings. Key features include new sample technologies and software as well as the ability to process up to 20 individual samples in parallel without a need for pooling and bar-coding, which can result in significant time and cost savings in clinical sequencing . Its design allows for flow cells and reagents to be loaded continuously while in operation, and also for up to 20 different assay types to be processed simultaneously and in random order. Many of these features - particularly the parallel processing of multiple samples and continuous loading of reagents and random samples - are considered essential for clinical sequencing . Two automation alternatives are being developed in combination with QIAGEN platforms to create workflow solutions from biological sample through to final result. One workflow integrates the NGS module into the QIAsymphony automation family, while a second is based on the QIAcube automated sample preparation system. Both workflows will offer extensive bioinformatics, including from the SAP collaboration. Financial terms of the IBS acquisition, which was completed during 2012, were not disclosed. "We are very excited to join forces with QIAGEN, which is an ideal partner to bring our new ultra-low cost sequencing technologies to the market as part of a complete workflow that will expand our product's use into new areas," said Steven J. Gordon, Ph.D., CEO and founder of Intelligent Bio-Systems. "We can deliver greater value to our customers from our novel technologies by leveraging QIAGEN's leadership in sample preparation, advanced gene panels and global reach along with the bioinformatics expected to emerge from the collaboration with SAP. Our goal is to better address the demands of clinical and core lab customers for complete solutions, since many have been struggling to adapt existing sequencing platforms to their workflows." Bioinformatics: SAP and QIAGEN are collaborating on bioinformatics efforts aimed at significantly reducing the time required for the analysis of sequencing data. The basis for the collaboration will be to apply the breakthrough SAP HANA ® platform in next-generation sequencing interpretation. Reducing this time period is seen as an important factor in driving greater use of sequencing technologies in new areas and reducing overall operating costs. "Rapid and accurate sequencing , assembly and interpretation of genomes represents a great challenge of our times in healthcare," said Dr. Vishal Sikka, member, Executive Board, SAP AG. "SAP HANA brings a dramatic acceleration to the data analysis challenges at the heart of genomics . We at SAP are very excited to collaborate with QIAGEN on this extraordinary opportunity to transform the biological sciences and help improve people's lives." QIAGEN intends to offer this new product portfolio across all of its customer classes, with priority focus on clinical research in Academia and Pharma as well as in some Molecular Diagnostics franchises, including select areas of Personalized Healthcare. The next-generation sequencing market, which up to this point has been driven primarily by life sciences research, is estimated to be more than $1 billion a year and growing rapidly as the use of these technologies expands into new areas. QIAGEN is a pioneer in enabling the use of biomarker data to guide treatment decisions through companion diagnostics based on real-time PCR technologies as well as a portfolio of sequencing -based diagnostic assays based on its Pyrosequencing ® technology. Together with its portfolio of next-generation sequencing technologies, QIAGEN can offer customers workflows that include the widest range of sample technologies to collect and process nucleic acid samples as well as leading assay and data analysis technologies for use across its customer classes in Academia, Pharma, Applied Testing and Molecular Diagnostics . The adoption of next-generation sequencing in clinical research and molecular diagnostics is still hampered by workflow challenges, particularly time required for data analysis as well as regulatory uncertainties and sequencing costs. The various features in this initiative seek to address these challenges, and could lead to NGS technologies being adopted in certain areas such as exploratory diagnostics, the diagnosis of complex diseases and treatment of cancer patients. NGS technologies are also expected to complement established routine molecular technologies such as real-time PCR . QIAGEN currently anticipates the investments planned to create this new NGS portfolio to be dilutive to adjusted EPS (earnings per share) by approximately $0.01 for full-year results in 2012 and by approximately $0.02 in 2013, but to be accretive to full-year results in 2014. About QIAGEN: QIAGEN N.V., a Netherlands holding company, is the leading global provider of Sample Assay Technologies that are used to transform biological materials into valuable molecular information. Sample technologies are used to isolate and process DNA , RNA and proteins from biological samples such as blood or tissue. Assay technologies are then used to make these isolated biomolecules visible and ready for interpretation. QIAGEN markets more than 500 products around the world, selling both consumable kits and automation systems to customers through four customer classes: Molecular Diagnostics (human healthcare), Applied Testing (forensics, veterinary testing and food safety), Pharma (pharmaceutical and biotechnology companies) and Academia (life sciences research). As of March 31, 2012, QIAGEN employed approximately 3,900 people in more than 35 locations worldwide. Further information can be found at http://www.qiagen.com/ . About Intelligent Bio-Systems, Inc.: Intelligent Bio-Systems, Inc. was founded in 2005 by Dr. Steven Gordon and Dr. Jingyue Ju to commercialize advanced sequencing technologies and patents from Columbia University. Next-generation sequencing refers to methods of analyzing DNA that began to emerge in the 1990s to improve upon earlier sequencing approaches that were time-consuming and laborious. Intelligent Bio-Systems' newer generation of sequencing by synthesis (SBS) technology incorporates proprietary advances in DNA sequence readout and DNA sample preparation, enabling low-cost, high throughput sequencing with high-quality data, making it attractive for customers.; 个人分类: Qiagen|1724 次阅读|0 个评论

[转载]Duplex-sequencing method better cancer detection and treatme: genesquared 2012-10-4 01:16; September 28, 2012 Duplex-sequencing method could lead to better cancer detection and treatment By Bobbi Nodell Posted under: Health and Medicine , News Releases , Research , Science , Technology Michael Schmitt and Jesse Salk came up with the idea for the duplex-sequencing method while on a flight to an ice climbing expedition. During an ice climbing trip to the Canadian Rockies last Christmas, two young researchers from the UW, Michael Schmitt and Jesse Salk, talked about a simple but powerful idea to get better results when looking at cancer cells. If they could reduce the error rate in DNA sequencing, then researchers could better pinpoint which cells are mutating. This improvement could lead to early diagnosis of cancer and a better treatment plan once researchers knew which cells were resistant to chemotherapy. The idea was to sequence both strands of DNA. If they saw a mutation in one strand but not the other, they would recognize it as error from sequencing and not a true mutation. Their results, published online in the Proceedings of the National Academy of Sciences in August, are getting wide accolades. “If its power is confirmed, duplex sequencing will likely improve our understanding of the clonal substructure of human cancers, modify the catalog of rare mutations, help to pinpoint mechanisms of mutation generation, and potentially identify mutator phenotypes,” wrote Christopher Klein, with the Experimental Medicine and Therapy Research group at the University of Regensburg, in an accompanying editorial. “Eventually, it may open doors to clinical applications in which diagnostic accuracy is the sine qua non for ethical treatment decisions. Mary Levin Some of the members of the Loeb lab who worked on the duplex sequencing: Dr. Lawrence Loeb, Dr. Scott Kennedy, Dr. Michael Schmitt, and Dr. Jesse Salk. Working in Dr. Lawrence Loeb’s cancer research laboratory at the UW, the scientists demonstrated an error rate of less than one mistake per half a million nucleotides sequenced and a theoretical limit of less than one error per billion nucleotides. In contrast, the standard method yielded one error for every 200 nucleotides. “Based on our estimates, we appear to have improved the accuracy of sequencing by 10 million-fold or more,” said Dr. Michael Schmitt, the lead author of the paper, “Detection of ultra-rare mutations by next-generation sequencing.” Schmitt, a postdoctoral fellow in the Loeb lab, added, “This theoretically makes it possible to sequence the entire genome of a cell without a single error.” The genome of a cell is 6 billion nucleotides. The UW researchers are eager to use this approach to test Loeb’s mutator phenotype hypothesis proposed in 1974. The idea, considered radical at the time, surmises that the mutation rate in the early stages of tumor development is greater than that of normal cells. This could explain how some cancers spread and become resistant to chemotherapies so quickly. If researchers could see which cells are mutating faster, then they could potentially detect cancer cells sooner at a more easily treatable stage. Loeb said this theory was accepted by just a minute percentage of the scientific community in 1974 and now probably about half. “Pharmaceutical companies want to believe there is a magic bullet for treating cancer,” he said. Rather than expecting a magic bullet, the realistic approach is to drastically slow down cancer progression to a point where a patient is more likely to die of something else, Loeb said. He said a cancer drug will kill some of the cells. but pre-existent mutant cells will start to multiply and spread. Loeb said he will feel vindicated for his mutator phenotype hypothesis once a study looks at DNA from human tumors and normal tissues from the same individual using this new technology. If everything goes well, this duplex-sequencing technology could generate meaningful clinical information from patients in two to three years, he said. The first application could be looking at recurrent tumors. How it evolved Duplex sequencing evolved in 2005 with the advent of “next generation sequencing,” which allowed researchers to sequence billions of nucleotides at a time. The most direct means for finding pre-existing cancer cells is to sequence cancer cells and identify cancer-associated mutations. But with the substantial error rate in the original technology, there were too many false-positive findings. Dr. Scott Kennedy, a researcher in the Loeb lab,had been testing a method to refine next generation sequencing called Safe-SeqS, published by Johns Hopkins University researchers last year. Schmitt started working on Safe-Seqs when he joined the lab as well. But they couldn’t get it to work. “We kept getting really screwy numbers and mutation types that didn’t make any sense,” said Kennedy. “We finally figured out it was due to damaged DNA.” Kennedy said he and Schmitt spent a significant amount of time discussing how to get around the problem, but didn’t have a solution. It was during a plane ride heading to an ice climbing trip last year that Schmitt and Salk came up with the duplex-sequencing approach. Back in the lab, Schmitt and Kennedy started running experiments sequencing M13 data – a virus that infects bacteria – because they had a good idea what the mutation rate should be. The implementation of a relatively simple concept was quite complicated and involved writing a number of custom programs and analyzing extremely large data sets. Salk, while vacationing in Thailand and doing medical rotations in Washington and Idaho, kept in contact with them through email and Skype. The results After only a matter of weeks, the UW researchers had their results. It worked. M13 has an established base substitution frequency of 3.0 x 10-6. Duplex sequencing measured a “nearly-identical” rate of 2.5 x 10-6, the researchers reported. “It’s pretty remarkable just how quickly it all came together,” said Salk. “In science, very few ideas work as planned. One thing that I think surprised us was that no one had developed this before. The basic premise of the idea was so simple that we spent the better part of a day trying to come up with reasons why to wouldn’t work, thinking that we must have missed something obvious.” For this new generation of scientists, all of whom are in their early 30s, coming up with an idea took leaving the lab. “I think the best ideas come when you are doing things you enjoy,” said Salk, whose grandfather, Dr. Jonas Salk, a medical researcher and virologist, discovered and developed the first polio vaccine in 1955. “For me, creativity does not happen in a cubicle.” Their work is described in a report on Genomeweb: “The duplex sequencing method uses 24-nucleotide-long tags split into two 12-nucleotide additions to each end of the DNA molecule, which the researchers calls a ‘duplex tag.’ The tags are incorporated into standard Illumina sequencing adapters. After tagging strands, the researchers PCR amplified them, yielding ‘families’ of molecules marked by a common tag. By grouping molecules that share a tag, the team could then compare sequences among them and eliminate any that didn’t show at least three duplicates with at least 90 percent sharing the same sequence. This step filtered out random errors introduced during PCR or sequencing.”; 个人分类: NGS|1821 次阅读|0 个评论

[转载]三款热门个人型测序仪比较: nooney1986 2011-9-14 15:24; 2009年，生物通抢先发布《新一代测序技术之三国时代》，聚焦了Illumina、罗氏和ABI的新一代测序仪。如今两年过去了，测序市场上的竞争愈加激烈，第二代测序仪的升级版频出，而第三代测序仪也加入了激战。各厂商的你追我赶，使得测序通量不断攀升，而测序成本也一降再降。然而，除了一些大实验室和基因组中心，更多的实验室只能是笑看风云。无奈，门槛太高，而且也没有足够的样品让仪器满负荷运行。为了让更多的实验室能够有机会使用新一代测序技术，各大公司也争相推出了个人型测序仪。在此，生物通小编将对这三款测序仪的性能参数进行比较，主要是通量、读长、质量和费用。我想，这四个参数也是大家在购买测序仪前最先考虑的。通量在这一参数上，MiSeq无疑具有最大优势。对于2×150 bp的末端配对运行，MiSeq的通量已超过1 Gb。GS Junior目前每次运行可得到超过35 Mb 高质量过滤后数据。Ion PGM的314芯片有着10 Mb的通量，而316芯片则提高了10倍，达到100 Mb以上。据Life Technologies介绍，第四季度上市的318芯片将带来1 Gb的通量。读长谈到读长，罗氏一定是二代测序仪制造商中笑得最开心的。GS Junior的平均读长达到400 bp。在通量上，GS Junior 虽稍逊一筹。然而，它的长读长为后续数据处理所带来的优势也是其他平台无法比拟的。以不久前的德国大肠杆菌事件为例，短读长平台多日未解的数据结果，GS Junior 数据5分钟内即给出了相应的生物学结论。根据最近公布的数据，Ion PGM测序仪的读长已经超过100 bp。据Ion Torrent的创始人Rothberg介绍，读长很快将超过200 bp，并达到400 bp。如果真是那样，将可以与GS Junior相媲美。准确性自Ion PGM 测序仪推出以来的6个月内，它的准确性已提高了一个数量级。Ion Torrent充分利用其简单测序试剂的固有准确性，实现了更多的信号，更好的信号处理以及改善的碱基检出。根据其发表在《Nature》上的论文，利用Ion PGM测序仪对大肠杆菌测序时，对于每次读取的前50个碱基，每个碱基的准确率超过99.6%，对于前100个碱基，大约在98.9%。在包含均聚物的部分基因组，准确率有所下降，但5碱基均聚物的准确率仍达到97.3%。而同样是对大肠杆菌测序，MiSeq平台也产生了高质量的数据，其中85%以上的碱基高于Q30，且有着均匀的GC覆盖。此结果与HiSeq平台的结果相似，表明MiSeq可很好地预测了高通量HiSeq 2000测序平台所带来的结果。费用想必大家都对这三款仪器的价格很感兴趣。但仪器售卖时涉及的因素较多，可能还需要配套仪器、试剂和耗材，因此生物通很难以一个准确的数字来概括每台仪器的价格。就国外公布的价格来说，Ion PGM是最低的，而MiSeq和GS Junior差不多。至于每次运行的费用，大家肯定也很想知道。下面是国外学者整理的单次运行的试剂费用，仅供大家参考。这三款个人型测序仪的价格比高端仪器要低得多，且有着快速的周转时间，因此适合快速、小规模的测序工作，比如对细菌的基因组进行测序或者对很多人基因组里的某一个基因区域进行测序，以帮助确诊疾病等。对于微生物等基因组的de novo测序，GS Junior无疑是最合适的，MiSeq和Ion PGM也适合，但装配可能比454更具挑战性。而对于定向位点的重测序，MiSeq的性价比最高，Ion PGM的数据则少于MiSeq。另外，个人型测序仪也在不断升级。Ion PGM的318芯片即将上市，如果真如它所预计的那样，通量达到1 Gb，读长达到400 bp，那市场恐怕又是另一番格局。当然，GS Junior和MiSeq也不会固步自封，也许不久之后的性能也会大大提升。; 个人分类: 分子诊断|2768 次阅读|0 个评论

[转载]BBSRC对新一代测序技术的评估: 热度 1 lry198010 2011-2-11 21:39; 附件是BBSRC对新一代测序技术的评估、相应的结论和在扶持新一代测序技术所给出的建议！总结比较全面，值得阅读; 个人分类: next-generation-sequence|2669 次阅读|1 个评论

我们需要什么样的参考序列: lry198010 2010-9-12 23:43; 随着测序技术的发展; 个人分类: next-generation-sequence|6 次阅读|0 个评论

[Paper Excerpt]Genome-wide analysis of allelic expression imbalance in human pri: maplesword 2010-8-28 23:58; Highlight: This is an easy-understanding paper, describing a detailed process for analyzing allele-specific expression with RNA-Seq data. It is very useful and it is possible for us to follow its steps to achieve the preliminary results. Outline: I. Expectation: a large fraction of the loci (SNPs) to have a regulatory role on gene expression via effects on transcription, message stability and splicing. II. Data 1. Samples: eight independently sequenced human poly(A)-selected transcriptomes obtained from primary cells from four healthy donors using high-throughput paired-end (PE) resequencing. For each of the four individuals there are two conditions: T-cell activation (stimulated) or unstimulated 2. Sequencing: Illumina GAII, 45 bp reads, most of which are paired end. 3. Mapping: Ensembl v52 CCDS was used as reference sequence set, with the additional one sequence per intron extending intron boundaries 40 bp on each side to allow mapping of reads overlapping exon-intron junctions. One sequence per non-standard exon-exon junction (up to three skipped exons) was also included. Reads were mapped using novoalign (www.novocraft.com) V2.05.12 PE mode for paired reads, and SE mode for SE data. Quality reads were defined as uniquely mapped reads with phred-scaled probability score =20. 4. Importance of transcript coverage: the ability to detect an allelic imbalance using ASE depends on two parameters: the strength of the allelic imbalance and the read depth at the reporter heterozygous SNP. The authors analytically computed the read depth required to demonstrate allelic imbalance for different allelic ratios. Based on the results, to remove SNPs providing almost no power to detect allelic imbalance, they only tested for ASE SNPs with read depth 50. 5. SNP calling: read depth =50, the frequency of the second most common SNP was at least 15%. 6. Quality filtering for heterozygous SNPs: to verify that the allelic call is independent of the position of the SNP within the 45 base reads, both distributions of positions (the two alleles) was compared using the Kolmogorov-Smirnov test; to check for strand-specific (forward/reverse) biases, a goodness-of-fit chisq test on the 2 by 2 table of allelic calls by strand was used. 7. Quantity: first tested 589673 dbSNPs (from Ensembl r52) for ASE and located in annotated spliced transcripts; also tested for ASE 4282208 intronic dbSNPs. Heterozygous SNPs with sufficient read depth were selected, and 4929 pairs of heterozygous SNPs/samples were able to test for allelic imbalance. Grouping these SNPs by transcripts for each of the samples provided 3107 pairs transcripts/samples with sufficient coverage for ASE analysis, with 1371 transcripts and 2701 SNPs. III. ASE test 1. Method: testing a single SNP for allelic imbalance uses a chisq goodness-of-fit test for even frequencies of both alleles. * When considering several SNPs located at the same genes at one time: first used novoPhase (http://www.gene.cimr.cam.ac.uk/todd/) to do phasing. With the known phase, the counts can be summed across heterozygous SNPs, couting only once reads overlapping multiple SNPs. 2. Results: tested a total of 4929 pairs of heterozygous SNPs/samples, and 370 SNPs showed eveidence of allelic imbalance at P 0.001, with FDR ~ 1%. 3. HapMap SNPs: 87796 dbSNPs in spliced transcripts passed HapMap quality filters. When restricting the analysis to this subset of SNPs, the ASE rate was significant lower (4.6%), showing a more reliable estimation. 4. Coverage: when restricting this analysis to HapMap SNPs with read depth 100, a higher ASE rate of 7.5% (66 of 878) was achieved. III. Validation to the ASE analysis 1. Genotyping data: the authors genotyped the four individuals using the Illumina Quad660W BeadChip. They lowered the minimum read depth required to call SNPs in RNA-Seq data to 20, and identified 9727 pairs of SNP/samples shared between RNA-seq and genotyping chip. In these 9727 pairs, 6885 calls are homozygous based on both genotyping chip and RNA-Seq, and only 1 call is homozygous based on chip but heterozygous based on RNA-seq. In the remaining 2841 pairs which got heterozygous calls based on chip, 14 are homozygous based on RNA-seq data. Four of these calls were located in transcript SNRPN which is a known parentally imprinted gene (monoallelic expression). Three are located in ERAP2 which is with known complete cis-acting differential allelic control. Four of these are of too low read depth. The other three calls may be real. 2. Single locus validation: to confirm that some of our findings are not the consequence of techinical biases; selected four pairs of HapMap SNPs/individuals and validated them using two different locus-specific assays: clone-based allele-specific expression (C-BASE) and PeakPicker. All four initial RNA-seq results replicated and PeakPicker/C-BASE provided consistent results. For C-BASE, an allelic bias was observed even using genomic DNA (technical bias), thus using the gDNA allelic distribution as control when using chisq test. IV. ASE analysis of disease-associated genes 1. Genes: the authors reviewed the literature and identified 79 genes previously assiciated with autoimmune disorders. 2. SNPs: includes all heterozugous SNPs even not listed in the Ensembl database but discovered using the previous method. 61 heterozygous SNPs with read depth 50 were found and 8 of them were not listed in dbSNP. These genes was located in 22 genes. 3. Counting separately each pair of SNP/individual: tested 127 pairs and 13 were imbalanced. 4. The analysis of these 13 pairs: first grouped them with the genes they were located at; did phasing to test whether the SNPs were consistent with each other. V. Possible biases 1. Sequencing chemistry biases: for PCR biases over-amplifying identical cDNA fragment, because identical mapping location for both 45 bp reads is unlikely to occur randomly, for each set of clonal read pairs the authors only included a single read pair in the analysis; for biases that are specific to forward/reverse read direction, the authors verified that for heterozygous SNPs the allelic ratio ditribution was consistent for read counts obtained for the reverse and forward strands (so they will make no difference). 2. In silico mapping biases: the biases towards the reference allele. The authors solved these by: replaced the reference allele with the corresponding genetic ambiguity code coding both possible alleles at this SNP; relaxed the threshold on the number of mismatches for allowing reads to be mapped in order to limit the impact of a small number of mismatched SNPs. These two additions corrected most of the reference allele bias. 3. The authors excluded heterozygous SNPs with 45 bp of a called indel in the same individual from the ASE analysis. (novoalign provides calls for small indels) 4. Other biases: 1) Low-complexity or repeat sequence surrounding heterozygous SNPs is associated with an elevated ASE positive rate. The authors defined low-complexity/reoeat sequences as more than 25% of the 90 bp surrounding sequence (45 bp on each side) was masked by RepeatMasker. 2) the ASE positive rate was lower for heterozygous SNPs that passed HapMap quality filters (mentioned before). 5. Validation of additional ASE results: used individual 1 data and selected 22 transcripts with a unique imbalanced heterozygous SNP (resequencing, PeakPicker). The data suggest that several biases, some of which unknown, increase the FDR. The false positive rate among non-HapMap SNPs is ~50%. Thus unproven quality SNPs should only be used for ASE estimation with great caution and the presence of multiple imbalanced SNPs is required to provide convincing evidence of ASE. VI. Other tests 1. PE vs. SE reads: PE is better because of higher coverage. 2. Sequence capture: to improve the read depth.; 个人分类: 未分类|4931 次阅读|0 个评论

[Paper Excerpt]Transcriptome genetics using second generation sequencing in a Ca: maplesword 2010-8-28 23:58; Outline: I. Data 1) Data source: the authors sequenced the mRNA fraction of the transcriptome of lymphoblastoid cell lines from 60 CEU individuals using 37-bp paired-end Illumina sequencing. Each individual yielded 16.9+-5.9 million reads that mapped to the NCBI36 assembly of the human genome using MAQ. 86% of filtered reads mapped to known exons in Ensembl v54. 2) Data quantification: read counts for each individual were scaled to a theoretical yield of 10m reads and corrected for peak insert size across corresponding libraries. The authors developed a new method FluxCapacitor to map read into specific isoform. (Q: Can we use cufflink or scripture to do this?) 3) Quality evaluation: the authors campared whole-gene read counts to array intensities generated with Illumina HG-6 v2 microarrays. 4) Attempt to infer abundance values for exons that are not screened: with the same principle as using the correlation structure (LD) of genetic variants to impute variants from a reference to any population sample of interest. II. SNP-Expression Association 1) Association of gene expression measured by RNA-Seq with genetic variation: see reference , Population genomics of human gene expression. 2) Evaluation of association: through permutation (see reference ), in exons, transcripts, and genes. * example of permutation provided by Xie Gangcai: after doing mapping, disarrange the nucleotides of each read and do mapping again to see how many reads are mapped as background or control, and then use some test to get a p-value to see significance of the original mapping. 3) A problem of RNA-Seq: RNA-Seq exon eQTLs have lower representation in low abundance genes indicating that rare transcripts are not well quantified at this level of coverage. (consistent with the other paper). 4) Replicate the eQTL discoveries: compared associations between this study and those obtained from sequencing the transcriptomes of an African population. The authors assessed the P-value distribution of matching CEU associations given the top associated SNP for 500 genes from the African population. ~33% of these signals were shared (P0.0001 assessed by permutations). 5) Enrichment of eQTLs given an exon's location: eQTLs entiched around the TSS. The authors also identified increased number of discoveries for the first, second and last exon compared to any middle exons. When assessing the distribution of significant eQTLs around the 5' end of the exon of interest, the authors found that significant eQTLs when found associated with the last exon are closer to the last exon than any other exon. III. Quantification of allele-specific-expression (ASE) 1) Transcriptome sequencing allows the quantification of ASE. 2) SNP to assess ASE: An average of 4000 heterozygote confirmed HapMap3 SNP positions per individual. 3) The proportion where both SNP alleles were detected: the authors assessed it as a function of mapping quality using SAMtools. 72% of heterozygote sites have both alleles detectable at least once with MAQ mapping quality 10, and the number slightly decreases with increasing mapping quality. 41% of the heterozygotes have more than 6 reads. 4) ASE assession: first corrected for reference to non-reference differential mapping for each library because of a tendency for the reference allele to be overrepresented in pileups over a heterozygote. With this frequency as the success rate when assessing the binomial probability of allele-specific expression, the authors tested for ASE. 5) Relationship between known eQTLs and ASE: first phasing double heterozygotes for both eQTL and ASE. As coverage increasing, the correlation between eQTL significance and ASE ratio improves; and then reads were summed across individuals to assess the one-sided ASE binomial P-value distribution with respect to eQTL phasing. For 0.01 and 0.001 significant eQTLs, the tail of the ASE P-value distribution was enriched, while for the exons without eQTLs, both tails of this distribution were enriched. (Q1: How to do the phasing? Q2: What is the one-sided ASE p-value mean? What is the test testing for?) * Phasing: for heterozygotes, phasing is to find out the alleles located on each chromosome. Unphased data - Genotype; Phased data - Haplotypes. 6) Relationship between rare eQTLs: the authors selected SNPs heterozygous in six or more individuals in exons without evidence for an eQTL, and examined patterns of haplotype homozygosity between individuals that shared a significant ASE signal (at P0.05) with those that did not. * Haplotype homozygosity: the probability of selecting two identical haplotype at random from a population. IV. Genetic basic of alternative splicing 1) The authors performed association between known variants affecting splicing signals with their respective genes and exons; in total, 963 variants for 788 genes were tested. 2) Stratification: splice variants were stratified in donor and acceptor variants and tested against abundance of exons 5' and 3' to the intron where they are residing. They found that donor associates with 5' exon more than the 3' one, while acceptor associates with 3' exons more than 5' exons. 3) Further assumption: if genetic variants are effecting transcript-specific expression, the authors should be able to detect heterogeneity in the transcript distribution found between chromosomes within an individual. 4) Measurement of degree to which genetics influences transcript-specific expression: look at insert size distribution of paired end reads over each heterozygote. Their expectation is that the heterogeneity of inserts sizes over significant ASE heterozygotes between each of their alleles would be increased relative to that between alleles of non-significant ASE heterozygotes (if one haplotype is increasing the expression of a particular transcript relative to the other allele, the insert size distribution would be changed). The heterozygotes with a minimum of 50 reads for both allels were tested, which include 901 positions. For each heterozygote for an individual, a bootstrapped Kolmogorov-Smirnov test was run for the respective insert size distribution (Q: Why use bootstrapped KS test instead of ordinary KS test?), and then the p-values were separated given the heterozygote was significant for ASE or not. Of the 901 heterozygotes, 235 were significant for ASE and 105 had significant transcript distribution heterogeneity; this corresponded to 72 genes which contained an ASE significant heterozygote. 5) Effect of genetic variants on events contribute to alternative isoforms (e.g. inclusion/exclusion of exons): derived from the authors' method, FluxCapacitor. Of 6600 quantified events, 110 are significant at the 0.01 permutation threshold. V. Advanced Methods; 个人分类: 未分类|3900 次阅读|0 个评论

[转载]基于微流体的新一代测序技术: lry198010 2010-6-11 22:17; 最近新成立的GnuBio公司，计划在今年12月中旬推出一种基于微流体的测序仪。这种测序仪的基本原理是通过把扩增的大约1kb的DNA模板包裹在一个纳升大小的小液滴里，然后再给这些小液滴转换为用条形码编码了的皮升大小的小液滴里（每一个小微点的直接大概只有1微米，注意1毫升ml=1000微升，1微升ul=1000纳升，1纳升nl=1000皮升，1皮升pl=1000飞升fl），然后通过高速照相设备可以读出对小液滴的条形码和测序得到信息，他们目前这套告诉照相设备，在一分钟以内可以读取1百万个小液滴的序列。预计其可以30$就可以得到覆盖人类基因组30倍的序列，读出至少是1kb。这个测序系统最大的特点是，其测序通量可以根据需要进行任意调节，也就是说，可以一次完成1个基因组的测序，也可以只完成几百条序列的测序。此外，因其特有的测序原理（微滴的barcode和序列是分开的），这套系统可以轻易的对不同来源的DNA序列进行barcode，在原理上说，只要条形码的数目是无限的，那么这个系统能用的barcode的数目是无限的。这是相关的英文报道; 个人分类: next-generation-sequence|4029 次阅读|0 个评论

[转载]一种结合杂交信号和manopore进行测序的单分子测序法: lry198010 2010-6-7 23:25; 来自波士顿大学和哈佛大学的研究者发展出一种结合杂交信号和manopore进行测序的单分子测序法，他们首先构建一种称为分子信标的寡聚核苷酸探针，每一种信标用两种不同的荧光标记表示，然后对待测序的DNA序列进行处理，使每一种碱基都能与特定的信标结合，然后引导这个已处理的DNA通过基于纳米技术构建的namopore，namopore顺次解脱待测DNA序列上的信标，而但信标别从DNA分子上解脱下来时，信标上的荧光基团就会淬发出荧光，通过高能超快速摄像机就把这些荧光信号依次记录下来，就可以读出待测DNA的序列。据研究者称，他们这种方法可以在一个小时以内完成500Gbp的测序量，而预计读长将达到1000bp以上。研究者称他们这种方法与其他基于纳米技术的方法最大不一样的地方就是，测序样品的制备和最后序列的读取是分开的，也就是说从获得DNA样品到处理这些DNA样品，加上信标的处理与最后进行DNA序列的读取是分开的，因此在序列的读取过程中不需要加入任何测序试剂（其他的测序技术一般都需要在序列的读取过程中反复的加入和洗脱反应试剂），与传统Sanger测序法测序样品的准备和上机通过电泳进行序列读取的过程是分开的相似。这就大大简化了测序设备的设计和生产，以及设备的后续升级。这是在genomeweb上的评价文章：Optical Recognition of Converted DNA Nucleotides for Single-Molecule DNA Sequencing Using Nanopore Arrays; 个人分类: next-generation-sequence|3493 次阅读|0 个评论

基于石墨单分子层纳米孔的测序技术: lry198010 2010-6-3 00:33; 我们知道，组成DNA序列的4种碱基有着不同的带电性质，如果能直接一一测量DNA链上每一碱基的带电性质，哪也就可以知道了DNA的序列了。一个很好的办法就是，用一块很薄的材料，然后在上面开一个很小的孔，测量通过小孔的电流（这个电流是恒定的），然后再测量当一个一段DNA序列通过这个小孔时的电流变化情况，通过电流的变化就可以测出DNA的序列了。这种设想最大的困难是，如何制造出厚度只有一个或小于一个碱基大小的材料。纳米技术所能制造的材料的厚度一般是单个碱基厚度的几十倍，很难满足用于这种方式的单分子测序技术。而来自荷兰Kavli纳米科学研究所的Grégory Schneider及其合作伙伴提出使用石墨单分子层来构建“薄板“的方法，用这种方法所构建的薄板的厚度只有一个碱基的厚度。他们使用构建所构建的薄板及其在上面构建的小孔对DNA分子通过这个小孔时的电流变化进行了测量，测量的结果能准确反应是否有DNA链通过小孔，是单链还是双链。目前他们构建的石墨单分子薄板材料只能给出DNA链是什么时候开始进入小孔，什么时候完全通过小孔的，但可以预计，在不久的将来，他们应该可以开发出测量DNA链上单个碱基通过小孔时的电量变化情况，从而实现无合成反应的DNA测序。 DNA Translocation through Graphene Nanopores 对这个方法的评论（不好意思，忘记来源了）; 个人分类: next-generation-sequence|6072 次阅读|2 个评论

re-sequence需要多高的覆盖度: lry198010 2010-5-6 18:55; 今天听报告，里面有一个内容就是通过re-sequence的方法来找到相关性状的候选基因。连续到这一段时间以来，发现有很多关于重测序利用的报道，不由的想到了这个问题，那就是：要达到我们的目的，re-sequence需要达到多大的覆盖度？初步思考，认为与下面的内容相关：（1）物种基因组结构的复杂程度，比如说像油菜这一的异源多倍体，其由A基因组和C基因组组成的，而A、C基因组的共同祖先最早可以追溯到3~5M年以前，并且这个共同的祖先在~10M年发生了三倍化的过程，所以如果想区分不同同源拷贝来源的基因，可能就需要相对比较高的覆盖度，并且序列的读长也要达到一定的要求。比如说，如果一个read的长度只有35bp，那么对很多油菜基因来说，re-sequence的数据通过mapping的方法就很判定不同同源拷贝之间的序列差异（A，C对应两个同源拷贝基因在外显子上的序列相似性平均在97%左右。paired-end测序也非常有必要。还有pacific公司提出的meta-paired方法。（2）测序的长度和方法。比如测序读长长的需要的覆盖度就要低一些，在同等覆盖度条件下，paired-end测序所能达到的效果就要好于flagment测序方法。（3）重测序的目的。如果只是想从re-sequence的数据，构建一个高密度的SNP图谱，那么根据一篇水稻的文章（Genome Res，2009，韩斌等人发表的），是乎只要有0.01~0.1倍的覆盖度就可以了。如果只想找到一定数量的SNP，则与两个待分析样品的基因组差异程度有关。如果想做基因组结构差异分析，则目前来看，就需要不同插入长度的paired-end测序，插入长度与想要发现的插入/缺失、倒位~异位的长度有关。但是有没有这样的一个公式，根据re-sequence的目的（p）、测序的方法（m）、读长（l）和基因组的结构（s）来估计需要的覆盖度(c)的公式(f)！c=f（p,m,l,s); 个人分类: next-generation-sequence|7362 次阅读|0 个评论

更多...

帐号		自动登录	找回密码
密码			注册

关闭 安全验证

标签: ngs

相关帖子

相关日志

关闭安全验证