科学网

 找回密码
  注册

tag 标签: 外显子测序

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

全外显子测序分析流程中,什么时候加入外显子区间文件
chenjianhai 2020-9-29 23:09
When should I restrict my analysis to specific intervals? This document covers the reasoning behind the use of genomic intervals. If you're looking for instructions on how to use intervals in practice, including argument details and supported formats, please see this doc . Depending on what you're trying to do, there are many reasons why you might want to tell a tool to operate on a subset of genomic regions only. We distinguish four main types of reasons for doing so: You want to run a quick test on a subset of data (often used in troubleshooting) You want to parallelize execution of an analysis across genomic regions You need to exclude regions that have bad or uninformative data where a tool is getting stuck The analysis you're running should only take data from those subsets due to how the underlying algorithm works The first three should be fairly self-explanatory, but let's go into a bit more detail on the fourth one. In a nutshell Whole genome analysis: Intervals are not required but they can help speed up analysis by eliminating difficult regions and enabling parallelism Exome analysis and other targeted sequencing: You must provide the list of targets, with padding, to exclude off-target noise. This will also speed up analysis and enable parallelism. Whole genome analysis It is not strictly necessary to restrict analysis to intervals when working with whole genomes, since presumably you're interested in all of it. However, from a technical perspective, you may want to mask out certain contigs (e.g. chrY or non-chromosome contigs) or regions (e.g. centromere) where you know the data is not reliable or is very messy, causing excessive slowdowns. In addition, defining whole-genome intervals allows you to parallelize execution across intervals using the scatter gather mode of parallelism. We share the lists of good whole-genome intervals that we use in our production pipelines for human analysis in our resource bundle (see Download page). Exome analysis and other targeted sequencing By definition, exome sequencing and other targeted sequencing data don’t cover the entire genome, so most analyses can be restricted to just the capture targets (genes or exons) to save processing time and enable scatter gather parallelism. In addition, there are some processing steps, such as BQSR, that should be restricted to the capture targets in order to eliminate off-target sequencing data, which is uninformative and is a source of noise. You should use the list of target intervals that corresponds to the library preparation method that was used to generate the data. If you're working with exome sequencing data that was prepared by someone else, you'll need to find out what kit was used; the kit manufacturers typically provide the lists of intervals that correspond to their kits on their website. We cannot provide you with a suitable interval lists unless you are sure that your data was sequenced at the Broad. Important notes: Whatever you end up using intervals for, keep this in mind: for tools that output a BAM or VCF file, the output file will only contain data from the intervals you specified. Any data that falls outside these intervals will be lost to downstream analysis. In general we recommend adding some padding to the intervals in order to include the flanking regions (typically about 100 bp). No need to modify your target list; you can have the GATK engine do it for you automatically using the interval padding argument. This is not required, but if you do use it, you should do it consistently at all steps where you use a list of intervals. You will have noticed by now that we do not provide detailed guidelines for which tool should or should not use an interval list in this article. For tool-by-tool recommendations, please see the example commands in the individual tool docs; they show the most common recommended usage for each. See also the Best Practices documentation for up to date implementation notes. 现在全外显子(wes)测序依然对很多疾病家系的致病变异鉴定起着很重要的作用。 遗憾的是,现在在中文网站找到的关于外显子测序分析的流程,似乎语焉不详,也充满错误的计算流程。 本文对其中几个关键问题做一个说明。 全外显子测序和全基因组测序的差别,流程可以是一样的吗? 答案是否定的。根据https://gatk.broadinstitute.org/hc/en-us/articles/360035889551?id=4133的说明,全外显子存在脱靶效应。需要用区间来锁定。当然wgs也可以利用区间锁定的办法,获得某些区间的变异,来排除一些reference质量差的区域SNP。 如何加入外显子区间信息,在哪里加入该信息? 答案:本文最前面的英文介绍是gatk的说明,里面说明了从BQSR就要开始加入-L 参数以便进行区间校正,排除脱靶测序数据。 安捷伦外显子芯片有很多bed文件,到底用哪一个? 答案:关于interval文件到底应该用安捷伦芯片的哪一个文件,很多人也存在很多纠结。例如英文网站也有很多人问,似乎老外也不是很明白。他们的回答也不是很明确。 https://www.biostars.org/p/422896/ 说明这个问题很普遍。 安捷伦的全外显子测序文件,有四个bed文件,其中有个padding文件。上面GATK说明里面,推荐用padding文件。 中文的全外显子测序流程,要么是用全基因组测序流程来蒙混过关。要么是不知道哪里加入排除脱靶效应的参数。有的认为是在HaplotypeCaller 这一步加入,这显然不是GATK说明里面推荐的。 一些国内网站推荐的分析流程都存在一些问题,例如知乎中的一个, https://zhuanlan.zhihu.com/p/137078769 该分析,没有使用正确的padding文件。 所以根据网络上的流程来分析,一定要小心。不可全信
3079 次阅读|0 个评论
外显子测序的Bias
qianggong 2011-9-22 15:21
今天本来准备做个模拟的。看看在外显子捕获测序的结果中bias来源情况。实验上基于分子杂交技术的外显子捕获和后期数据分析中的序列比对都会倾向于获得更多的与reference一致的序列,然而目的是要发现基因组中的变异序列。这样的bias势必会影响最后获得的突变频率。对于频率较高的突变还好,而对于低频突变,影响就很大了。如何有效的校正这些bias,对于群体遗传学研究来说至关重要。 可是,模拟做到一半,我决定放弃了。原因是无论最后得出来外显子捕获还是序列比对的贡献大,这些因素的影响在基因组的不同区域是不均一的,因此不存在统一的校正系数。 一个可能有效地办法是比较不同外显子测序数据,看看捕获效率是否与reference存在位置的相关性,继而获取一个捕获/比对效率数据库,用以校正实测数据中的bias。
个人分类: 科研笔记|7120 次阅读|0 个评论
外显子测序同样可准确找出致病基因的查新分析
xupeiyang 2009-8-18 19:46
http://www.sciencenet.cn/htmlnews/2009/8/222546.shtm 美国国家心肺血液研究所8月16日发布新闻公告称,该所与其他研究机构合作,成功地对12名对象的基因外显子进行了测序,从而证明了使用外显子测序方法确定罕见致病变异基因的可行性和应用价值。其研究成果发表于8月16日《自然》杂志网络版。 外显子是人类基因的一部分,包含着合成蛋白质所需要的信息。全部外显子,称为外显子组(exome),只占人类基因组的百分之一。测定外显子序列只需针对外显子区域的DNA即可,因此远比进行全基因组序列测序更简便、经济,已成为现阶段基因测序工作的重心。 新闻公告称,为了验证外显子测序的实用性,由美国国立卫生研究院资助的一个研究小组选取了12名测序对象进行外显子测序。其中8人(4个非洲约鲁巴人、2个东亚人、2个欧裔美国人)的DNA图谱已由国际人类基因组单体图计划确认;另4人无亲缘关系,同为弗里曼谢尔登综合征患者,该症是由MYH3基因变异引起的一种罕见遗传性疾病。引入这4人参与测序的目的,就是确认外显子测序是否能检测到他们DNA中的MYH3基因突变。 研究人员首先将12个基因组DNA样本制成片段,再使用特殊探针选出其中仅含有外显子的片段。经过对12组外显子组的测序和分析,总计确定了3亿个DNA序列碱基,这是到目前为止使用第二代测序技术获取的人类基因编码序列的最大数据。 与常用的人类基因组测序相比,外显子测序在检测基因变异方面,无论是普通变异还是罕见变异,都表现出很高的灵敏度。通过这种测序,研究人员能够识别出一系列DNA错拼,如单核苷酸多态性变异(SNPs),以及基因序列的插入和删除。 而通过采用多步骤分类检测法,滤掉普通变异和个人独具的变异后,研究人员从4名弗里曼谢尔登综合征患者的DNA中准确找出了致病基因变异。他们的研究表明,对于单个基因变异引起的疾病,外显子测序同样可以准确找到致病基因,与全基因组测序无异。研究人员认为,外显子测序也可用于多重基因变异引起的常见疾病,如糖尿病和癌症的研究中,来揭示该种疾病的致病基因。 美国国家心肺血液研究所主任伊丽莎白G诺贝尔博士指出,进行外显子测序,可以得到关于疾病遗传基础的相关信息,希望这种指向性的目标测序有朝一日能用于大量人群,以帮助发现常见疾病如高血压、高胆固醇的遗传学基础。 该研究由美国国立卫生研究院资助,美国华盛顿大学、安捷伦科技公司(该公司得到国家心肺血液研究所资助),国家人类基因组研究所以及尤尼斯肯尼迪施赖弗国家儿童健康与人类发育研究所的科学家共同参与。该研究也是国家心肺血液研究所和国家人类基因组研究所的合作项目外显子组计划的一部分,旨在开发、验证并应用一种低成本、高效率的外显子测序方法。 相关文献: Title: Genetic variation in an individual human exome . PMID: 18704161 Related Articles Authors: Ng, P C , Levy, S , Huang, J , Stockwell, T B , Walenz, B P , Li, K , Axelrod, N , Busam, D A , Strausberg, R L , Venter, J C Journal: PLoS Genet , Vol. 4 (8): e1000160 , 2008 Abstract: There is much interest in characterizing the variation in a human individual, because this may elucidate what contributes significantly to a person's phenotype, thereby enabling personalized genomics. We focus here on the variants in a person's ' exome ,' which is the set of exons in a genome, because the exome is believed to harbor much of the functional variation. We provide an analysis of the approximately 12,500 variants that affect the protein coding portion of an individual's genome. We identified approximately 10,400 nonsynonymous single nucleotide polymorphisms (nsSNPs) in this individual, of which approximately 15-20% are rare in the human population. We predict approximately 1,500 nsSNPs affect protein function and these tend be heterozygous, rare, or novel. Of the approximately 700 coding indels, approximately half tend to have lengths that are a multiple of three, which causes insertions/deletions of amino acids in the corresponding protein, rather than introducing frameshifts. Coding indels also occur frequently at the termini of genes, so even if an indel causes a frameshift, an alternative start or stop site in the gene can still be used to make a functional protein. In summary, we reduced the set of approximately 12,500 nonsilent coding variants by approximately 8-fold to a set of variants that are most likely to have major effects on their proteins' functions. This is our first glimpse of an individual's exome and a snapshot of the current state of personalized genomics. The majority of coding variants in this individual are common and appear to be functionally neutral. Our results also indicate that some variants can be used to improve the current NCBI human reference genome. As more genomes are sequenced, many rare variants and non-SNP variants will be discovered. We present an approach to analyze the coding variation in humans by proposing multiple bioinformatic methods to hone in on possible functional variation. Affiliation: J Craig Venter Institute, Rockville , Maryland , United States of America. png@jcvi.org Pubmed MeSH: Gene Frequency , Genetic Diseases, Inborn , Humans , Mutation Wikipedia: Amino Acids , Bio-informatics , Bioinformatics , Cistron , Computational Biology , Exon , Gene , Genetic diversity , Genetic material , Genetic variation , Genome , Genomics , Human Genome , Nucleotides , Phenotype , Proteins , SNPs , Single Nucleotide Polymorphism , Single nucleotide polymorphisms , Variation (genetics) 信息分析平台: http://www.gopubmed.org/web/gopubmed/ 检索策略: exome and SNPs =exomer complex exome and SNPs exome and MYH3 相关文献:499 篇 相关文献计量分析结果: 1 2 Top Years Publications 2008 198 2007 100 2006 44 2005 35 2004 23 2002 16 2009 16 2003 16 2001 11 2000 6 1999 6 1998 3 1984 3 1997 3 1996 2 1994 1 1993 1 1978 1 1982 1 1991 1 1 2 1 2 3 Top Countries Publications USA 217 United Kingdom 39 Germany 29 Canada 18 China 16 Japan 16 Australia 12 France 11 Netherlands 7 Denmark 7 Italy 5 Belgium 5 Sweden 4 South Korea 4 Spain 4 Switzerland 3 Norway 3 Ireland 3 Singapore 3 Iceland 3 1 2 3 1 2 3 ... 9 Top Cities Publications Seattle 18 Boston 16 Cambridge, USA 16 Cambridge 11 New York 11 Bethesda 11 San Francisco 9 Toronto 8 Tokyo 8 Houston 7 Heidelberg 7 Oxford 6 London 6 Stanford 6 Rockville 5 Atlanta 5 St. Louis 5 Munich 5 Berlin 5 Beijing 4 1 2 3 ... 9 1 2 3 ... 11 Top Journals Publications Nat Genet 33 Genome Res 22 Nature 22 Hum Mutat 17 Science 16 Bmc Genomics 14 Plos Genet 13 Nucleic Acids Res 12 Hum Mol Genet 9 Nat Rev Genet 8 P Natl Acad Sci Usa 8 Genome Biol 7 Am J Hum Genet 7 Pharmacogenomics 7 Proc Natl Acad Sci U S A 6 Pharmacogenet Genomics 6 Annu Rev Genom Hum G 5 Bioinformatics 5 Bmc Bioinformatics 5 Trends Genet 5 1 2 3 ... 11 1 2 3 ... 184 Top Authors Publications Bork P 9 Daly M 8 Lander E 8 Scherer S 7 Dermitzakis E 7 Nickerson D 7 Gibbs R 7 Mullikin J 6 Hussler D 6 Clamp M 6 Collins F 6 Muzny D 6 Feuk L 5 Abril J 5 Mardis E 5 Sunyaev S 5 Fulton L 5 Worley K 5 Birney E 5 Weinstock G 5 1 2 3 ... 184 1 2 3 ... 121 Top Terms Publications Humans 445 Genes 287 Genomics 276 Variation (Genetics) 265 Genome 262 Polymorphism, Single Nucleotide 256 Genome, Human 235 Mutation 178 Proteins 158 Phenotype 149 Nucleotides 143 Animals 124 Alleles 119 DNA 111 Genotype 96 Genetic Diseases, Inborn 86 Gene Frequency 86 Polymorphism, Genetic 82 Base Sequence 80 Amino Acids 76 1 2 3 ... 121
个人分类: 科技查新|3227 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-14 21:37

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部