目前基于新一代测序技术(Next Generation Sequencing, NGS)在临床充分利用的产前筛查方法(或者叫唐氏筛查)的原理最早在PNAS的两篇文章,通在2008年上发表,相关的数据处理策略略有差别,两篇文章分别 是 “Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma” 和“Noninvasive diagnosis of fetal aneuploidy by shotgunsequencing DNA from maternal blood“,两个团队基于各自的技术手段,先后都成立了提供唐氏筛查的技术服务公司。 文章 Noninvasive prenatal diagnosis of fetal chromosomal aneuploidy by massively parallel genomic sequencing of DNA in maternal plasma中的数据处理原理如下: 第一步: Fetal DNA (thick red fragments) circulates inmaternal plasma as a minor population among a high background of maternal DNA (black fragments). A sample containing a representative profile of DNA molecules in maternal plasma is obtained. 在母体血液中有游离的胎儿DNA片段,在上述途中,红色的片段代表胎儿DNA片段。从图上可以看出,母体血液中胎儿DNA的浓度非常低,为此科研工作者的目的在于用数据处理的方法能够把信息放大到准确检测的地步。 第二步: In this study, one end of eachplasma DNA molecule was sequenced for 36 bp using the Solexa sequencing-by-synthesis approach. 当时用Solexa测序技术测取36bp的reads,目前主要基于Illumina、CG和Proton的测序技术,读长和深度都有所增加。 第三步: The chromosomal origin of each 36-bp sequence was identified through mapping to the human reference genome by bioinformatics analysis.将测取到的reads进行mapping操作,比对到人的参考序列上。 第四步: The number of unique sequences mapped to each chromosome was counted and then expressed as a percentage of all unique sequences generated for the sample, termed % chrN for chromosome N. 统计每条染色体上uniq比对的reads数目,同时计算每条染色体上uniq reads的比例。 第五步 :Z-scores for each chromosome and each test sample were calculated using the formula shown. 计算一个z-score值,利用染色体上uniq reads的比例减去control样本中对应此条染色体uniq reads比例的平均值,再除以control样本中此条染色体uniq reads的标准差。例如,21三体筛查, 假设有100个已知怀正常胎儿的样本,通过前4步的计算,可以得到每个样本在21号染色体上,uniq reads的比例具体值,如此可以计算这100个样本在21号染色体的平均值和标准差。如果1个新的样本要判断是不是21三体,可以根据上述公式去计算21号染色体的z-score,超过界定界限则是患病胎儿,反之正常。 其中就可以看到,正常样本数越多,平均值和标准差就越准(基线越来越标准),如此得到的z-score更准确。 通过以上方法最终的结果如下: 从上图可以看出,如果利用染色体上的uniq reads比例不能很好区分正常胎儿和3体胎儿,利用z-score的计算,可以发现3体胎儿的z-score超过3,正常胎儿都位于3以上,为此达到很理想的检测目的。 文章 Noninvasive diagnosis of fetal aneuploidy by shotgunsequencing DNA from maternal blood 中的数据处理原理和结果如下: 第一步:We obtained on average 10 million 25-bp sequence tags per sample. 测序获取数据; 第二步:将每条染色体按50kb划分bin,根据bin中uniq reads的数目进行排序,取中位数代表此条染色体的reads数目; 第三步:每个样本有21条染色体的reads数目,再对21个数据排序,取中位数代表此样本的reads数目; 第四步:用每条染色体的reads数目除以每个样本的reads数目得到归一化的一个值; 第五部:利用第四步得到的诡异化值去计算置信区间,当样本的值不在此置信区间时,被判定为异常样本。 利用此处理策略,可发现针对21三体,能够较理想区分正常胎儿和异常胎儿样本。 此处理策略也依赖于正常样本的数目,正常样本越多,计算得到的置信区间会越准确,判断的结果也就会越准。 参考文章: http://www.pnas.org/content/early/2008/10/03/0808319105 http://www.pnas.org/content/105/51/20458.abstract
HOMER Software for motif discovery and next-gen sequencing analysis Mapping reads to the genome Once you have checked your FASTQ files and have removed all adapter sequences that might be present, you are ready to map them to a reference genome. While tools like BLAST and BLAT are powerful methods, they are not specialized for the vast amount of data generated by next-generation sequencers. It is highly recommended that you use a next-gen specific read alignment program. Note: While BWA, Bowtie, and Tophat have received the most attention as short read alignment algorithms, new methods such as STAR are significantly faster and in some cases more accurate. More like it are likely to come along soon (if not already available...) Selecting a reference genome Both the organism and the exact version (i.e. hg18, hg19) are very important when mapping sequencing reads. Reads mapped to one version are NOT interchangeable with reads mapped to a different version. I would follow this recommendation list when choosing a genome (Obviously try to match species or sub species when selecting a genome): Do you have a favorite genome in the lab that already has a bunch of experiments mapped to it? Use that one. Do any of your collaborators have a favorite genome/version? Use the latest stable release - I would recommend using genomes curated at UCSC so that you can easily visualize your data later using the UCSC Genome Browser . (i.e. mm9, hg19) Mapping to a normal genomic reference You want to map your reads directly to the genome if you are performing: ChIP-Seq GRO-Seq DNase-Seq MNase-Seq Hi-C Popular short read aligners Full List Most Popular: bowtie : fast, works well bowtie2 : fast, can perform local alignments too BWA - Fast, allows indels, commonly used for genome/exome resequencing Subread - Very fast, (also does splice alignment) STAR - Extremely fast (also does splice alignment, requires at least 30 Gb memory) To be honest, I would probably recommend STAR for almost any application at this point if you have the memory ( see below ) Example of alignment with bowtie2: Step 1 - Build Index (takes a while, but only do this once): After installing bowtie2 , the reference genome must first be indexed so that reads may be quickly aligned. You can download pre-made indices from the bowtie website (check for those here first). Please be aware that bowtie2 indexes are different than bowtie indexes. To perform make your own from FASTA files, do the following: Download FASTA files for the unmasked genome of interest if you haven't already (i.e. from UCSC ). Do NOT use masked sequences. I also tend to remove the *_random.fa chromosomes. These often contain part of the canonical chromosomes in addition to regions that cannot be placed in the assembly. The problem with these regions is that the part shared with the canonical chromosome will be present twice, making it difficult to map the reads to a unique location. Concatenate FASTA files into a single file. We can do this using the UNIX cat command, which merges files together cat *.fa genome.fa From the directory containing the genome.fa file, run the bowtie2-build command. The default options usually work well for most genomes. For example, for hg19: /path-to-bowtie-programs/bowtie2-build genome.fa hg19 This command will create 6 files with a *.bt2 file extension. These will then be used by bowtie2 or newer versions of tophat to map data. Copy the *.bt2 files to the bowtie2 indexes directory (or somewhere you can store them) so that bowtie2 knows where to find them later: cp *.bt2 /path-to-bowtie-programs/indexes/ Step 2 - Align sequences with bowtie (perform for each experiment): The most common output format for high-throughput sequencing is FASTQ format , which contains information about the sequence (A,C,G,Ts) and quality information which describes how certain the sequencer is of the base calls that were made. In the case of Illumina sequencing, the output is usually a s_1_sequence.txt file. More recently the Illumina pipeline will output a file that is debarcoded with your sample name such as Experiment1.fastq. In addition, much of the data available in the SRA , the primary archive of high-throughput sequencing data, is in this format. To use bowtie2 to map this data, run the following command: /path-to-bowtie-programs/bowtie2 -p # cpu -x genome index prefix fastq file output filename i.e. /programs/bowtie2 -p 8 -x hg19 Experiment1.fastq Experiment1.sam Where genome index prefix is the common prefix for the *.bt2 files that were created using the bowtie2-build command in step 1, or from a downloaded index. If the *.bt2 files are stored int the /path-to-bowtie2-program/indexes/ directory, you only need to specify the name of the index. If the index files are located elsewhere, you can specify the full path names of the index files (in the examples above this would be -x /programs/indexes/hg19). In the example above, we use 8 processors/threads. The bowtie2 program is very parallel in nature, with near linear speed up with additional processors. The default output is a SAM file. To learn more about SAM alignment files, go to the next section on SAM/BAM files. There are many other options for bowtie2 that may be important for your study, but most ChIP-Seq data can be mapped using the default options. NOTE : Usually, the process of removing duplicate reads or removing non-unique alignments is handled by the downstream analysis program. Example of alignment with bwa: Step 1 - Build Index (takes a while, but only do this once): bwa index -a bwtsw genome.fa Step 2 - Align reads to the index (perform for each experiment): # where the genome.fa is in the same directory with your index from the first step bwa mem -t #cpus genome.fa reads.fq aln-se.sa #paired end bwa mem -t #cpus genome.fa reads1.fq reads2.fq aln-pe.sam Mapping to a genome while allowing splicing Usually, any kind of RNA-seq method will benefit from looking for splicing junctions in addition to genomic mapping: RNA-Seq (polyA+, total) CLIP-Seq/PAR-CLIP RIP-Seq ChIRP-Seq Ribo-Seq Popular splice read aligners Tophat (most popular) Subread - Very fast, (also does splice alignment) STAR - Extremely fast (also does splice alignment, requires at least 30 Gb memory) lots of others... I would probably recommend STAR for RNA-Seq is you have enough RAM ( see below ) Example of aligning RNA-Seq data with STAR (very very fast) STAR is one of a growing number of short read aligners that takes advantage of advances in computational power to optimize the short read mapping process (original publication: Dobin et al. ) The key limitation with STAR is computer RAM - STAR requires at least 30 Gb to align to the human or mouse genomes. To install STAR, visit the website and follow their instructions. Step 1 - Build a genome index Like all aligners, you need to build the genome index first. The STAR website has links to the hg19 genome index if you want to skip this step. First, make a directory for the index (i.e. mm9-starIndex/). Then copy the genome FASTA file it the directory and cd into it to make that directory your current directory. Then, to build the index, the command is: STAR --runMode genomeGenerate --runThreadN # cpus --genomeDir genome output directory --genomeFastaFiles input Genome FASTA file i.e. STAR --runMode genomeGenerate --runThreadN 24 --genomeDir ./ --genomeFastaFiles mm9.fa Note: For small genomes, you may need to add the following to create a proper index: --genomeSAindexNbases 6 Step 2 - Align RNA-Seq Reads to the genome with STAR To align RNA-Seq reads with STAR, run the following command: STAR --genomeDir Directory with the Genome Index --runThreadN # cpus --readFilesIn FASTQ file --outFileNamePrefix OutputPrefix i.e. STAR --genomeDir mm9-starIndex/ --runThreadN 24 --readFilesIn Experiment1.fastq --outFileNamePrefix Experiment1Star # paried-end data: STAR --genomeDir mm9-starIndex/ --runThreadN 24 --readFilesIn read1.fastq read2.fastq --outFileNamePrefix Experiment1Star STAR will create several output files - the most important of which is the *.Aligned.out.sam - you will likely want to convert this to a bam file and sort it to use it with other programs. The default output is a SAM file. To learn more about SAM alignment files, go to the next section on SAM/BAM files. Notes on STAR STAR is very very fast - it will rip through 20 million reads in a matter of minutes if you have 20 cpus working for you. In fact, the longest part of the program is loading the index into memory. If you are aligning several experiments in a row, add the option --genomeLoad LoadAndKeep and STAR will load the genome index into shared memory so that it can use it for the next time you run the program. Also, converting the sam file into a sorted bam file will take much longer than aligning the data in the first place. Example of Alignment with Tophat (not recommended) Tophat is basically a specialized wrapper for bowtie2 - it manipulates your reads and aligns them with bowtie2 in order to identify novel splice junctions. It can also use given splice junctions/gene models to map your data across known splice junctions. Step 1 - Build Index (takes a while, but only do this once): This part is exactly the same as for bowtie2 - if you already made or downloaded an index for bowtie2, you can skip this step. Step 2 - Align your RNA-seq data to the genome using Tophat To use tophat to map this data, run the following command: /path-to-bowtie-programs/tophat -o output directory -p # cpu /path-to-genome-index/prefix fastq file For example: /programs/tophat -o TophatOutput/ -p 8 /programs/indexes/hg19 Experiment1.fastq Paired-end Example: /programs/tophat -o TophatOutputPE/ -p 8 /programs/indexes/hg19 Experiment1.r1.fastq Experiment1.r2.fastq In the example above, we use 8 processors/threads. The tophat2 program contains a mix of serial and parallel routines, so more processors doesn't necessarily mean it will be finished much faster. In particular, if you are performing coverage based searchers, it may take a long time (multiple processors will not help). In general, if you have multiple samples to map, it's better to map them at the same time with separate commands rather than mapping them one at a time with mapping processors. Tophat places several output files in an output directory. The most important is the accepted_hits.bam file - this is the alignment file that will be used for future analysis (more info here ). There are additional files that can be useful, including a junctions.bed file with records all of the splice junctions discovered in the data. Important Tophat Parameters: --library-type fr-unstranded | fr-firststrand | fr-secondstrand Describes which method was used to generate your RNA-seq library. If you used a method that doesn't produce strand specific signal (i.e. just sequencing a cDNA library), then select the fr-unstranded. If you use a stranded method that sequences the first DNA strand made (like a dUTP method), then use fr-firststrnad. If you use direct ligation methods, then fr-secondstrand. Correctly specifying the library type will help Tophat identify splice junctions. -G GTF file Use this option to specify a known transcriptome to map the reads against. By default, tophat will also search for de novo splice events, but this will help it known were the known ones are so that you don't miss them. A GTF files are called Gene Transfer Files, and a description of the format can be found here . To get a GTF file for your organism, you can usually get one from UCSC Table Browser: In the output format, be sure to select GTF file - the file you download from here should work with tophat. Tophat Mapping Strategy If your goal is to identify novel transcripts and you have several separate experiments, I would recommend pooling all of your data together into a single expeirment/FASTQ file and mapping your data in one run. One of the ways Tophat tries to identify novel junctions is by first identifying exons by mapping segments of reads to the genome using bowtie2. The more segments it's able to map, the more confident it is about putative exons and the greater the chance it will identify unannotated splice sites. You can then go back with your novel splice sites and remap the original experiments (not pooled) to get reads for each individual experiment using the -j raw junction file . This is most useful for short reads. This general strategy is also useful if running cufflinks... Mapping bisulfite-treated DNA MethylC-Seq, BS-Seq, or RBBS-Seq data requires a special mapping strategy. In these experiments the genomic DNA is bisulfite treated, causing all unmethylated cytosines to be converted to uracil, which will utimately show up as thymine after sequencing. This is a clever way to figure out which cytosines are methylated in the genome, but requires a clever mapping strategy to avoid bias detection. I'll try to include an example of mapping this type of data in the near future, for now consider this list of BS-aligners . De novo Assembly Sometimes it makes more sense to perform de novo assembly instead of mapping reads to a reference genome. Assembly is well beyond the scope of this tutorial. Genomics assembly from short reads: Velvet , SOAPdenovo Transcript assembly: Trinity Can't figure something out? Questions, comments, concerns, or other feedback: cbenner@salk.edu http://homer.salk.edu/homer/basicTutorial/mapping.html
QIAGEN unveils initiative to create next-generation sequencing portfolio for use in clinical research and molecular diagnostics http://www.qiagen.com/about/pressreleases/pressreleaseview.aspx?PressReleaseID=384 Aim to expand next-generation sequencing beyond current focus on life sciences research QIAGEN plans to offer sample-to-result workflows that integrate its sample preparation and assay products with a next-generation benchtop sequencer and new bioinformatics Initiative combines broad range of QIAGEN products with acquisition of sequencing specialist Intelligent Bio-Systems, Inc. and a new strategic collaboration with SAP AG HILDEN, Germany, and GERMANTOWN, Maryland, USA, June 25, 2012 - QIAGEN N.V. (NASDAQ: QGEN; Frankfurt Prime Standard: QIA) today unveiled an advanced initiative to enter the field of next-generation sequencing (NGS) that aims to establish these technologies as routine processes used in new areas such as clinical research and molecular diagnostics . QIAGEN is in the advanced stages of creating sample-to-result, efficient and cost-effective NGS workflow solutions. These will combine a broad range of QIAGEN products - including automated sample preparation solutions ( nucleic acid extraction, DNA enrichment, library preparation and targeted gene analysis panels) - with a previously undisclosed next-generation benchtop sequencer in development with Intelligent Bio-Systems, Inc., a privately held company that QIAGEN has acquired. New bioinformatics, including solutions emerging from a new collaboration with SAP AG, will be incorporated into the workflows. A first sample-to-result NGS solution is expected to be launched next year, while details on specifications and launch plans are set to be released in early 2013. "The rapid advances in next-generation sequencing have enabled life science researchers to unlock many secrets about the molecular building blocks of life. Our ambition is to create a new dimension of benefits for these technologies by offering workflow solutions for clinical use, particularly to develop new medicines and improve healthcare with advanced diagnostics," said Peer M. Schatz, CEO of QIAGEN N.V. "While next-generation sequencing is viewed today mainly as a research tool, our initiative is to expand beyond this and to offer applications designed to address the needs of customers in clinical research and molecular diagnostics ." Key elements of this initiative include: Content: The development of a broad portfolio of gene panels designed for NGS analysis based on QIAGEN's extensive offering of molecular content, including GeneGlobe (www.geneglobe.com), an online portal that offers access to more than 60,000 well-defined and characterized molecular assays. In the first wave, QIAGEN plans to offer eight preconfigured gene panels for use in cancer, as well as enable customers to create customized panels for specific molecular pathways and diseases. Sample technologies: An extensive range of NGS sample preparation products is planned to be launched that are based on QIAGEN's global leadership position in sample technologies and enzymology. NGS module: A previously undisclosed NGS benchtop sequencer is in late stages of development with Intelligent Bio-Systems, Inc. (IBS), a privately held U.S. company that QIAGEN has acquired. This novel system can process multiple samples in parallel with highest flexibility and performance, and benefits from the use of proprietary sequencing by synthesis (SBS) technology exclusively licensed from Columbia University. Building on elements of previous IBS designs as well as on QIAGEN technologies, this new system - which is expected to enter beta testing with customers in 2012 - seeks to offer a new dimension of benefits and cost savings. Key features include new sample technologies and software as well as the ability to process up to 20 individual samples in parallel without a need for pooling and bar-coding, which can result in significant time and cost savings in clinical sequencing . Its design allows for flow cells and reagents to be loaded continuously while in operation, and also for up to 20 different assay types to be processed simultaneously and in random order. Many of these features - particularly the parallel processing of multiple samples and continuous loading of reagents and random samples - are considered essential for clinical sequencing . Two automation alternatives are being developed in combination with QIAGEN platforms to create workflow solutions from biological sample through to final result. One workflow integrates the NGS module into the QIAsymphony automation family, while a second is based on the QIAcube automated sample preparation system. Both workflows will offer extensive bioinformatics, including from the SAP collaboration. Financial terms of the IBS acquisition, which was completed during 2012, were not disclosed. "We are very excited to join forces with QIAGEN, which is an ideal partner to bring our new ultra-low cost sequencing technologies to the market as part of a complete workflow that will expand our product's use into new areas," said Steven J. Gordon, Ph.D., CEO and founder of Intelligent Bio-Systems. "We can deliver greater value to our customers from our novel technologies by leveraging QIAGEN's leadership in sample preparation, advanced gene panels and global reach along with the bioinformatics expected to emerge from the collaboration with SAP. Our goal is to better address the demands of clinical and core lab customers for complete solutions, since many have been struggling to adapt existing sequencing platforms to their workflows." Bioinformatics: SAP and QIAGEN are collaborating on bioinformatics efforts aimed at significantly reducing the time required for the analysis of sequencing data. The basis for the collaboration will be to apply the breakthrough SAP HANA ® platform in next-generation sequencing interpretation. Reducing this time period is seen as an important factor in driving greater use of sequencing technologies in new areas and reducing overall operating costs. "Rapid and accurate sequencing , assembly and interpretation of genomes represents a great challenge of our times in healthcare," said Dr. Vishal Sikka, member, Executive Board, SAP AG. "SAP HANA brings a dramatic acceleration to the data analysis challenges at the heart of genomics . We at SAP are very excited to collaborate with QIAGEN on this extraordinary opportunity to transform the biological sciences and help improve people's lives." QIAGEN intends to offer this new product portfolio across all of its customer classes, with priority focus on clinical research in Academia and Pharma as well as in some Molecular Diagnostics franchises, including select areas of Personalized Healthcare. The next-generation sequencing market, which up to this point has been driven primarily by life sciences research, is estimated to be more than $1 billion a year and growing rapidly as the use of these technologies expands into new areas. QIAGEN is a pioneer in enabling the use of biomarker data to guide treatment decisions through companion diagnostics based on real-time PCR technologies as well as a portfolio of sequencing -based diagnostic assays based on its Pyrosequencing ® technology. Together with its portfolio of next-generation sequencing technologies, QIAGEN can offer customers workflows that include the widest range of sample technologies to collect and process nucleic acid samples as well as leading assay and data analysis technologies for use across its customer classes in Academia, Pharma, Applied Testing and Molecular Diagnostics . The adoption of next-generation sequencing in clinical research and molecular diagnostics is still hampered by workflow challenges, particularly time required for data analysis as well as regulatory uncertainties and sequencing costs. The various features in this initiative seek to address these challenges, and could lead to NGS technologies being adopted in certain areas such as exploratory diagnostics, the diagnosis of complex diseases and treatment of cancer patients. NGS technologies are also expected to complement established routine molecular technologies such as real-time PCR . QIAGEN currently anticipates the investments planned to create this new NGS portfolio to be dilutive to adjusted EPS (earnings per share) by approximately $0.01 for full-year results in 2012 and by approximately $0.02 in 2013, but to be accretive to full-year results in 2014. About QIAGEN: QIAGEN N.V., a Netherlands holding company, is the leading global provider of Sample Assay Technologies that are used to transform biological materials into valuable molecular information. Sample technologies are used to isolate and process DNA , RNA and proteins from biological samples such as blood or tissue. Assay technologies are then used to make these isolated biomolecules visible and ready for interpretation. QIAGEN markets more than 500 products around the world, selling both consumable kits and automation systems to customers through four customer classes: Molecular Diagnostics (human healthcare), Applied Testing (forensics, veterinary testing and food safety), Pharma (pharmaceutical and biotechnology companies) and Academia (life sciences research). As of March 31, 2012, QIAGEN employed approximately 3,900 people in more than 35 locations worldwide. Further information can be found at http://www.qiagen.com/ . About Intelligent Bio-Systems, Inc.: Intelligent Bio-Systems, Inc. was founded in 2005 by Dr. Steven Gordon and Dr. Jingyue Ju to commercialize advanced sequencing technologies and patents from Columbia University. Next-generation sequencing refers to methods of analyzing DNA that began to emerge in the 1990s to improve upon earlier sequencing approaches that were time-consuming and laborious. Intelligent Bio-Systems' newer generation of sequencing by synthesis (SBS) technology incorporates proprietary advances in DNA sequence readout and DNA sample preparation, enabling low-cost, high throughput sequencing with high-quality data, making it attractive for customers.
September 28, 2012 Duplex-sequencing method could lead to better cancer detection and treatment By Bobbi Nodell Posted under: Health and Medicine , News Releases , Research , Science , Technology Michael Schmitt and Jesse Salk came up with the idea for the duplex-sequencing method while on a flight to an ice climbing expedition. During an ice climbing trip to the Canadian Rockies last Christmas, two young researchers from the UW, Michael Schmitt and Jesse Salk, talked about a simple but powerful idea to get better results when looking at cancer cells. If they could reduce the error rate in DNA sequencing, then researchers could better pinpoint which cells are mutating. This improvement could lead to early diagnosis of cancer and a better treatment plan once researchers knew which cells were resistant to chemotherapy. The idea was to sequence both strands of DNA. If they saw a mutation in one strand but not the other, they would recognize it as error from sequencing and not a true mutation. Their results, published online in the Proceedings of the National Academy of Sciences in August, are getting wide accolades. “If its power is confirmed, duplex sequencing will likely improve our understanding of the clonal substructure of human cancers, modify the catalog of rare mutations, help to pinpoint mechanisms of mutation generation, and potentially identify mutator phenotypes,” wrote Christopher Klein, with the Experimental Medicine and Therapy Research group at the University of Regensburg, in an accompanying editorial. “Eventually, it may open doors to clinical applications in which diagnostic accuracy is the sine qua non for ethical treatment decisions. Mary Levin Some of the members of the Loeb lab who worked on the duplex sequencing: Dr. Lawrence Loeb, Dr. Scott Kennedy, Dr. Michael Schmitt, and Dr. Jesse Salk. Working in Dr. Lawrence Loeb’s cancer research laboratory at the UW, the scientists demonstrated an error rate of less than one mistake per half a million nucleotides sequenced and a theoretical limit of less than one error per billion nucleotides. In contrast, the standard method yielded one error for every 200 nucleotides. “Based on our estimates, we appear to have improved the accuracy of sequencing by 10 million-fold or more,” said Dr. Michael Schmitt, the lead author of the paper, “Detection of ultra-rare mutations by next-generation sequencing.” Schmitt, a postdoctoral fellow in the Loeb lab, added, “This theoretically makes it possible to sequence the entire genome of a cell without a single error.” The genome of a cell is 6 billion nucleotides. The UW researchers are eager to use this approach to test Loeb’s mutator phenotype hypothesis proposed in 1974. The idea, considered radical at the time, surmises that the mutation rate in the early stages of tumor development is greater than that of normal cells. This could explain how some cancers spread and become resistant to chemotherapies so quickly. If researchers could see which cells are mutating faster, then they could potentially detect cancer cells sooner at a more easily treatable stage. Loeb said this theory was accepted by just a minute percentage of the scientific community in 1974 and now probably about half. “Pharmaceutical companies want to believe there is a magic bullet for treating cancer,” he said. Rather than expecting a magic bullet, the realistic approach is to drastically slow down cancer progression to a point where a patient is more likely to die of something else, Loeb said. He said a cancer drug will kill some of the cells. but pre-existent mutant cells will start to multiply and spread. Loeb said he will feel vindicated for his mutator phenotype hypothesis once a study looks at DNA from human tumors and normal tissues from the same individual using this new technology. If everything goes well, this duplex-sequencing technology could generate meaningful clinical information from patients in two to three years, he said. The first application could be looking at recurrent tumors. How it evolved Duplex sequencing evolved in 2005 with the advent of “next generation sequencing,” which allowed researchers to sequence billions of nucleotides at a time. The most direct means for finding pre-existing cancer cells is to sequence cancer cells and identify cancer-associated mutations. But with the substantial error rate in the original technology, there were too many false-positive findings. Dr. Scott Kennedy, a researcher in the Loeb lab,had been testing a method to refine next generation sequencing called Safe-SeqS, published by Johns Hopkins University researchers last year. Schmitt started working on Safe-Seqs when he joined the lab as well. But they couldn’t get it to work. “We kept getting really screwy numbers and mutation types that didn’t make any sense,” said Kennedy. “We finally figured out it was due to damaged DNA.” Kennedy said he and Schmitt spent a significant amount of time discussing how to get around the problem, but didn’t have a solution. It was during a plane ride heading to an ice climbing trip last year that Schmitt and Salk came up with the duplex-sequencing approach. Back in the lab, Schmitt and Kennedy started running experiments sequencing M13 data – a virus that infects bacteria – because they had a good idea what the mutation rate should be. The implementation of a relatively simple concept was quite complicated and involved writing a number of custom programs and analyzing extremely large data sets. Salk, while vacationing in Thailand and doing medical rotations in Washington and Idaho, kept in contact with them through email and Skype. The results After only a matter of weeks, the UW researchers had their results. It worked. M13 has an established base substitution frequency of 3.0 x 10-6. Duplex sequencing measured a “nearly-identical” rate of 2.5 x 10-6, the researchers reported. “It’s pretty remarkable just how quickly it all came together,” said Salk. “In science, very few ideas work as planned. One thing that I think surprised us was that no one had developed this before. The basic premise of the idea was so simple that we spent the better part of a day trying to come up with reasons why to wouldn’t work, thinking that we must have missed something obvious.” For this new generation of scientists, all of whom are in their early 30s, coming up with an idea took leaving the lab. “I think the best ideas come when you are doing things you enjoy,” said Salk, whose grandfather, Dr. Jonas Salk, a medical researcher and virologist, discovered and developed the first polio vaccine in 1955. “For me, creativity does not happen in a cubicle.” Their work is described in a report on Genomeweb: “The duplex sequencing method uses 24-nucleotide-long tags split into two 12-nucleotide additions to each end of the DNA molecule, which the researchers calls a ‘duplex tag.’ The tags are incorporated into standard Illumina sequencing adapters. After tagging strands, the researchers PCR amplified them, yielding ‘families’ of molecules marked by a common tag. By grouping molecules that share a tag, the team could then compare sequences among them and eliminate any that didn’t show at least three duplicates with at least 90 percent sharing the same sequence. This step filtered out random errors introduced during PCR or sequencing.”
Highlight: This is an easy-understanding paper, describing a detailed process for analyzing allele-specific expression with RNA-Seq data. It is very useful and it is possible for us to follow its steps to achieve the preliminary results. Outline: I. Expectation: a large fraction of the loci (SNPs) to have a regulatory role on gene expression via effects on transcription, message stability and splicing. II. Data 1. Samples: eight independently sequenced human poly(A)-selected transcriptomes obtained from primary cells from four healthy donors using high-throughput paired-end (PE) resequencing. For each of the four individuals there are two conditions: T-cell activation (stimulated) or unstimulated 2. Sequencing: Illumina GAII, 45 bp reads, most of which are paired end. 3. Mapping: Ensembl v52 CCDS was used as reference sequence set, with the additional one sequence per intron extending intron boundaries 40 bp on each side to allow mapping of reads overlapping exon-intron junctions. One sequence per non-standard exon-exon junction (up to three skipped exons) was also included. Reads were mapped using novoalign (www.novocraft.com) V2.05.12 PE mode for paired reads, and SE mode for SE data. Quality reads were defined as uniquely mapped reads with phred-scaled probability score =20. 4. Importance of transcript coverage: the ability to detect an allelic imbalance using ASE depends on two parameters: the strength of the allelic imbalance and the read depth at the reporter heterozygous SNP. The authors analytically computed the read depth required to demonstrate allelic imbalance for different allelic ratios. Based on the results, to remove SNPs providing almost no power to detect allelic imbalance, they only tested for ASE SNPs with read depth 50. 5. SNP calling: read depth =50, the frequency of the second most common SNP was at least 15%. 6. Quality filtering for heterozygous SNPs: to verify that the allelic call is independent of the position of the SNP within the 45 base reads, both distributions of positions (the two alleles) was compared using the Kolmogorov-Smirnov test; to check for strand-specific (forward/reverse) biases, a goodness-of-fit chisq test on the 2 by 2 table of allelic calls by strand was used. 7. Quantity: first tested 589673 dbSNPs (from Ensembl r52) for ASE and located in annotated spliced transcripts; also tested for ASE 4282208 intronic dbSNPs. Heterozygous SNPs with sufficient read depth were selected, and 4929 pairs of heterozygous SNPs/samples were able to test for allelic imbalance. Grouping these SNPs by transcripts for each of the samples provided 3107 pairs transcripts/samples with sufficient coverage for ASE analysis, with 1371 transcripts and 2701 SNPs. III. ASE test 1. Method: testing a single SNP for allelic imbalance uses a chisq goodness-of-fit test for even frequencies of both alleles. * When considering several SNPs located at the same genes at one time: first used novoPhase (http://www.gene.cimr.cam.ac.uk/todd/) to do phasing. With the known phase, the counts can be summed across heterozygous SNPs, couting only once reads overlapping multiple SNPs. 2. Results: tested a total of 4929 pairs of heterozygous SNPs/samples, and 370 SNPs showed eveidence of allelic imbalance at P 0.001, with FDR ~ 1%. 3. HapMap SNPs: 87796 dbSNPs in spliced transcripts passed HapMap quality filters. When restricting the analysis to this subset of SNPs, the ASE rate was significant lower (4.6%), showing a more reliable estimation. 4. Coverage: when restricting this analysis to HapMap SNPs with read depth 100, a higher ASE rate of 7.5% (66 of 878) was achieved. III. Validation to the ASE analysis 1. Genotyping data: the authors genotyped the four individuals using the Illumina Quad660W BeadChip. They lowered the minimum read depth required to call SNPs in RNA-Seq data to 20, and identified 9727 pairs of SNP/samples shared between RNA-seq and genotyping chip. In these 9727 pairs, 6885 calls are homozygous based on both genotyping chip and RNA-Seq, and only 1 call is homozygous based on chip but heterozygous based on RNA-seq. In the remaining 2841 pairs which got heterozygous calls based on chip, 14 are homozygous based on RNA-seq data. Four of these calls were located in transcript SNRPN which is a known parentally imprinted gene (monoallelic expression). Three are located in ERAP2 which is with known complete cis-acting differential allelic control. Four of these are of too low read depth. The other three calls may be real. 2. Single locus validation: to confirm that some of our findings are not the consequence of techinical biases; selected four pairs of HapMap SNPs/individuals and validated them using two different locus-specific assays: clone-based allele-specific expression (C-BASE) and PeakPicker. All four initial RNA-seq results replicated and PeakPicker/C-BASE provided consistent results. For C-BASE, an allelic bias was observed even using genomic DNA (technical bias), thus using the gDNA allelic distribution as control when using chisq test. IV. ASE analysis of disease-associated genes 1. Genes: the authors reviewed the literature and identified 79 genes previously assiciated with autoimmune disorders. 2. SNPs: includes all heterozugous SNPs even not listed in the Ensembl database but discovered using the previous method. 61 heterozygous SNPs with read depth 50 were found and 8 of them were not listed in dbSNP. These genes was located in 22 genes. 3. Counting separately each pair of SNP/individual: tested 127 pairs and 13 were imbalanced. 4. The analysis of these 13 pairs: first grouped them with the genes they were located at; did phasing to test whether the SNPs were consistent with each other. V. Possible biases 1. Sequencing chemistry biases: for PCR biases over-amplifying identical cDNA fragment, because identical mapping location for both 45 bp reads is unlikely to occur randomly, for each set of clonal read pairs the authors only included a single read pair in the analysis; for biases that are specific to forward/reverse read direction, the authors verified that for heterozygous SNPs the allelic ratio ditribution was consistent for read counts obtained for the reverse and forward strands (so they will make no difference). 2. In silico mapping biases: the biases towards the reference allele. The authors solved these by: replaced the reference allele with the corresponding genetic ambiguity code coding both possible alleles at this SNP; relaxed the threshold on the number of mismatches for allowing reads to be mapped in order to limit the impact of a small number of mismatched SNPs. These two additions corrected most of the reference allele bias. 3. The authors excluded heterozygous SNPs with 45 bp of a called indel in the same individual from the ASE analysis. (novoalign provides calls for small indels) 4. Other biases: 1) Low-complexity or repeat sequence surrounding heterozygous SNPs is associated with an elevated ASE positive rate. The authors defined low-complexity/reoeat sequences as more than 25% of the 90 bp surrounding sequence (45 bp on each side) was masked by RepeatMasker. 2) the ASE positive rate was lower for heterozygous SNPs that passed HapMap quality filters (mentioned before). 5. Validation of additional ASE results: used individual 1 data and selected 22 transcripts with a unique imbalanced heterozygous SNP (resequencing, PeakPicker). The data suggest that several biases, some of which unknown, increase the FDR. The false positive rate among non-HapMap SNPs is ~50%. Thus unproven quality SNPs should only be used for ASE estimation with great caution and the presence of multiple imbalanced SNPs is required to provide convincing evidence of ASE. VI. Other tests 1. PE vs. SE reads: PE is better because of higher coverage. 2. Sequence capture: to improve the read depth.
Outline: I. Data 1) Data source: the authors sequenced the mRNA fraction of the transcriptome of lymphoblastoid cell lines from 60 CEU individuals using 37-bp paired-end Illumina sequencing. Each individual yielded 16.9+-5.9 million reads that mapped to the NCBI36 assembly of the human genome using MAQ. 86% of filtered reads mapped to known exons in Ensembl v54. 2) Data quantification: read counts for each individual were scaled to a theoretical yield of 10m reads and corrected for peak insert size across corresponding libraries. The authors developed a new method FluxCapacitor to map read into specific isoform. (Q: Can we use cufflink or scripture to do this?) 3) Quality evaluation: the authors campared whole-gene read counts to array intensities generated with Illumina HG-6 v2 microarrays. 4) Attempt to infer abundance values for exons that are not screened: with the same principle as using the correlation structure (LD) of genetic variants to impute variants from a reference to any population sample of interest. II. SNP-Expression Association 1) Association of gene expression measured by RNA-Seq with genetic variation: see reference , Population genomics of human gene expression. 2) Evaluation of association: through permutation (see reference ), in exons, transcripts, and genes. * example of permutation provided by Xie Gangcai: after doing mapping, disarrange the nucleotides of each read and do mapping again to see how many reads are mapped as background or control, and then use some test to get a p-value to see significance of the original mapping. 3) A problem of RNA-Seq: RNA-Seq exon eQTLs have lower representation in low abundance genes indicating that rare transcripts are not well quantified at this level of coverage. (consistent with the other paper). 4) Replicate the eQTL discoveries: compared associations between this study and those obtained from sequencing the transcriptomes of an African population. The authors assessed the P-value distribution of matching CEU associations given the top associated SNP for 500 genes from the African population. ~33% of these signals were shared (P0.0001 assessed by permutations). 5) Enrichment of eQTLs given an exon's location: eQTLs entiched around the TSS. The authors also identified increased number of discoveries for the first, second and last exon compared to any middle exons. When assessing the distribution of significant eQTLs around the 5' end of the exon of interest, the authors found that significant eQTLs when found associated with the last exon are closer to the last exon than any other exon. III. Quantification of allele-specific-expression (ASE) 1) Transcriptome sequencing allows the quantification of ASE. 2) SNP to assess ASE: An average of 4000 heterozygote confirmed HapMap3 SNP positions per individual. 3) The proportion where both SNP alleles were detected: the authors assessed it as a function of mapping quality using SAMtools. 72% of heterozygote sites have both alleles detectable at least once with MAQ mapping quality 10, and the number slightly decreases with increasing mapping quality. 41% of the heterozygotes have more than 6 reads. 4) ASE assession: first corrected for reference to non-reference differential mapping for each library because of a tendency for the reference allele to be overrepresented in pileups over a heterozygote. With this frequency as the success rate when assessing the binomial probability of allele-specific expression, the authors tested for ASE. 5) Relationship between known eQTLs and ASE: first phasing double heterozygotes for both eQTL and ASE. As coverage increasing, the correlation between eQTL significance and ASE ratio improves; and then reads were summed across individuals to assess the one-sided ASE binomial P-value distribution with respect to eQTL phasing. For 0.01 and 0.001 significant eQTLs, the tail of the ASE P-value distribution was enriched, while for the exons without eQTLs, both tails of this distribution were enriched. (Q1: How to do the phasing? Q2: What is the one-sided ASE p-value mean? What is the test testing for?) * Phasing: for heterozygotes, phasing is to find out the alleles located on each chromosome. Unphased data - Genotype; Phased data - Haplotypes. 6) Relationship between rare eQTLs: the authors selected SNPs heterozygous in six or more individuals in exons without evidence for an eQTL, and examined patterns of haplotype homozygosity between individuals that shared a significant ASE signal (at P0.05) with those that did not. * Haplotype homozygosity: the probability of selecting two identical haplotype at random from a population. IV. Genetic basic of alternative splicing 1) The authors performed association between known variants affecting splicing signals with their respective genes and exons; in total, 963 variants for 788 genes were tested. 2) Stratification: splice variants were stratified in donor and acceptor variants and tested against abundance of exons 5' and 3' to the intron where they are residing. They found that donor associates with 5' exon more than the 3' one, while acceptor associates with 3' exons more than 5' exons. 3) Further assumption: if genetic variants are effecting transcript-specific expression, the authors should be able to detect heterogeneity in the transcript distribution found between chromosomes within an individual. 4) Measurement of degree to which genetics influences transcript-specific expression: look at insert size distribution of paired end reads over each heterozygote. Their expectation is that the heterogeneity of inserts sizes over significant ASE heterozygotes between each of their alleles would be increased relative to that between alleles of non-significant ASE heterozygotes (if one haplotype is increasing the expression of a particular transcript relative to the other allele, the insert size distribution would be changed). The heterozygotes with a minimum of 50 reads for both allels were tested, which include 901 positions. For each heterozygote for an individual, a bootstrapped Kolmogorov-Smirnov test was run for the respective insert size distribution (Q: Why use bootstrapped KS test instead of ordinary KS test?), and then the p-values were separated given the heterozygote was significant for ASE or not. Of the 901 heterozygotes, 235 were significant for ASE and 105 had significant transcript distribution heterogeneity; this corresponded to 72 genes which contained an ASE significant heterozygote. 5) Effect of genetic variants on events contribute to alternative isoforms (e.g. inclusion/exclusion of exons): derived from the authors' method, FluxCapacitor. Of 6600 quantified events, 110 are significant at the 0.01 permutation threshold. V. Advanced Methods
来自波士顿大学和哈佛大学的研究者发展出一种结合杂交信号和manopore进行测序的单分子测序法 ,他们首先构建一种称为分子信标的寡聚核苷酸探针,每一种信标用两种不同的荧光标记表示,然后对待测序的DNA序列进行处理,使每一种碱基都能与特定的信标结合,然后引导这个已处理的DNA通过基于纳米技术构建的namopore,namopore顺次解脱待测DNA序列上的信标,而但信标别从DNA分子上解脱下来时,信标上的荧光基团就会淬发出荧光,通过高能超快速摄像机就把这些荧光信号依次记录下来,就可以读出待测DNA的序列。据研究者称,他们这种方法可以在一个小时以内完成500Gbp的测序量,而预计读长将达到1000bp以上。研究者称他们这种方法与其他基于纳米技术的方法最大不一样的地方就是,测序样品的制备和最后序列的读取是分开的,也就是说从获得DNA样品到处理这些DNA样品,加上信标的处理与最后进行DNA序列的读取是分开的,因此在序列的读取过程中不需要加入任何测序试剂(其他的测序技术一般都需要在序列的读取过程中反复的加入和洗脱反应试剂),与传统Sanger测序法测序样品的准备和上机通过电泳进行序列读取的过程是分开的相似。这就大大简化了测序设备的设计和生产,以及设备的后续升级。 这是在genomeweb上的评价 文章:Optical Recognition of Converted DNA Nucleotides for Single-Molecule DNA Sequencing Using Nanopore Arrays
我们知道,组成DNA序列的4种碱基有着不同的带电性质,如果能直接一一测量DNA链上每一碱基的带电性质,哪也就可以知道了DNA的序列了。一个很好的办法就是,用一块很薄的材料,然后在上面开一个很小的孔,测量通过小孔的电流(这个电流是恒定的),然后再测量当一个一段DNA序列通过这个小孔时的电流变化情况,通过电流的变化就可以测出DNA的序列了。这种设想最大的困难是,如何制造出厚度只有一个或小于一个碱基大小的材料。纳米技术所能制造的材料的厚度一般是单个碱基厚度的几十倍,很难满足用于这种方式的单分子测序技术。而来自荷兰Kavli纳米科学研究所的Grégory Schneider及其合作伙伴提出使用石墨单分子层来构建“薄板“的方法,用这种方法所构建的薄板的厚度只有一个碱基的厚度。 他们使用构建所构建的薄板及其在上面构建的小孔对DNA分子通过这个小孔时的电流变化进行了测量,测量的结果能准确反应是否有DNA链通过小孔,是单链还是双链。 目前他们构建的石墨单分子薄板材料只能给出DNA链是什么时候开始进入小孔,什么时候完全通过小孔的,但可以预计,在不久的将来,他们应该可以开发出测量DNA链上单个碱基通过小孔时的电量变化情况,从而实现无合成反应的DNA测序。 DNA Translocation through Graphene Nanopores 对这个方法的评论(不好意思,忘记来源了)