xbinbzy的个人博客分享 http://blog.sciencenet.cn/u/xbinbzy

博文

16s rRNA sequencing中chimera的检测

已有 7500 次阅读 2015-10-29 10:42 |个人分类:科研文章|系统分类:科研笔记|关键词:学者| 嵌合体, catch, 16s

   在16s rRNA的分析中,在数据处理过程中重要的一步操作是鉴定和去除嵌合体。

   嵌合体(chimera)产生的原因主要是PCR过程中产生的错误:During this PCR amplification,chimeras might be created due to incomplete extension.  

   在扩增过程中,chimera的比例可能会达到70%Likewise, the percentage of chimeric se-quences in the unique amplicon pool of PCR-amplified samplesmight reach values higher than 70%。(实验过程的优化中,考虑减少嵌合体的产生)。

   嵌合体的处理策略,主要可分reference-based和de novo两种。

   reference-based的原理是Reference-based methods basically screen the sequences poten-tially containing chimeras against a curated reference databasewith chimera-free sequences. 工具有PintailBellerophon。在这基础上ChimeraSlayer实现了较大的改动和性能优化,ChimeraSlayer的基本原理是which uses 30% of each end as a seed forsearching a reference data set, finding the closest parent (if any),performing alignments, and scoring to the candidate parents. 它的缺点在于it was not able to detect chimeras with a smallchimeric range. 在ChimeraSlayer的基础上,reference-based UCHIME表现性能更好,In reference-based UCHIME, query sequences are divided into four nonoverlapping segments andsearched against a reference database.  有研究报道,ChimeraSlayer and reference-based UCHIME在长reads中具备短chimeric的时候效果不如DECIPHER,were reported to have a lower accuracy than that of DECIPHER in cases where the algorithms were challenged with a data set con-taining chimeric sequences with a short chimeric range and longsequence lengths. DECIPHER的原理是The DECIPHER algorithm is a search-based algorithm that splits the query sequence into different fragmentsand analyzes whether those fragments are uncommon in the ref-erence phylogenetic group where the query sequence is classified. If a significant amount of fragments is assigned to a phylogeneticgroup different from the complete query sequence, the sequence isclassified as chimeric.

   实际上,chimera检测工具的性能评估很难做到公平统一,各工具有各自的适应范围。

   De novo策略的原理是De novo methodologies are generally based on the fact thatparents of any chimeric sequence have gone through at least onemore PCR cycle than chimeric sequences. 工具有Perseus de novo UCHIMEde novo ChimeraSlayer,这些工具目前都已整合到了mothur中。近来the UPARSE pipeline was released, combining in one step chimera detection with clus-tering of sequencing reads into operational taxonomic units.

   reference-based和de novo两种策略各自具有优缺点:1)reference-based的优势在于In situations dealing with well-studied environments, the reference-based approaches werefound to be very effective in distinguishing between chimeras andchimera-free (parent) sequences. 2)reference-based的劣势在于efficiency is assumed to belower when dealing with less well-known environments.而这正好是de novo方法的优势所在。3)de novo方法的劣势在于most of the de novo approaches depend on redundancy differences between chimeras and parents, assuming that the number of parentsequences has to be at least one time more redundant than theircorresponding chimeric sequences. This requires data abundances to have been reported with high accuracy. (这个就涉及到多少数据量是能保证效果的)


   15年CATCh(Combining Algorithms to Track Chimeras)出现,其原理在于利用其他chimera的检测工具作为input,利用有监督的学习方法去进行分类模型构建,利用测试集验证分类模型的准确度,最终确定分类模型来鉴定嵌合体:which is able to discriminate betweenchimeric and nonchimeric sequences based on a specific set ofinput data (called features in the context of machine learning).在此工具中,输入数据不是测序reads,而是不同工具鉴定chimera的结果。With this tool, we use as input data not the sequence read charac-teristics but rather the results (e.g., scores) of different individual chimera detection tools mentioned above and integrate them intoone prediction. All different tools are run separately, and theiroutput values are combined and processed by the classifier in or-der to give a prediction of whether a read is a chimera or not.  此工具在处理时,主要分为3个步骤:(1)the necessary input features (i.e., output values of the different chimeradetection tools) are identified. (2)the classifier istrained via a supervised learning approach. In this step, the classifier learns to make a correct prediction based on example input data; in our case, training data consist of the output features of a set of sequences reads obtained from different chimera detection tools, together with their correct classification (i.e., whether thisread is a chimeric sequence or not).  (3)In the third step, the trained classifier can be used to predict chimeric sequences in new, previously unseen data (i.e., data that did not belong to the training data).  By feeding the outputs of the different individual chimeradetection tools into the classifier, CATCh is able to classify them into chimeric and chimera-free subsets. As two different types ofchimera detection tools exist, either reference based or de novo, wealso developed two different versions of CATCh. In order to illus-trate its performance, CATCh (reference based as well as de novo)was benchmarked against other chimera detection tools using var-ious publicly available benchmark data. (利用其他工具的检测结果作为输入数据,不同工具的结果出现不一致的情况时,对模型结果是否存在影响



参考文献:

Mysara M, Saeys Y, Leys N et al. CATCh, an ensemble classifier for chimera detection in 16S rRNA sequencing studies. 2015, 81(5):1573-84. doi: 10.1128/AEM.02896-14




https://m.sciencenet.cn/blog-306699-931853.html

上一篇:Nature:Obese与gut microbiota的关系
下一篇:基于PGM和Miseq对16s分析的区别

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-2 09:04

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部