xbinbzy的个人博客分享 http://blog.sciencenet.cn/u/xbinbzy

博文

HMP计划-Metagenomics: Facts and Artifacts, and Computationa

已有 3758 次阅读 2015-8-18 11:08 |系统分类:科研笔记|关键词:学者| HMP计划

文章:Metagenomics: Facts and Artifacts, and Computational Challenges

杂志J Comput Sci Technol

年份:2009


基于next generation sequencing技术,存在metagenomics、16s、targeted metagenomics。


1)组装及基因预测(Assembly and gene prediction

   多数工具是对单基因组的组装,未考虑多个混合基因组数据的组装。

   常见的工具Velvet(a Eulerian path assembler)、ALLPATHS、Euler-SR。

   基因的预测策略:use 6-frame translation when conducting a similarity search on the short reads,目前的研究进展不算很多,目前的工具有MetaGene、Orphelia。


2)菌群的多样性定性与定量分析工具(Tools for characterizing microbial diversity qualitatively and quantitatively)

   需要明确样本中taxonomic composition的信息;

   常用工具:MEGAN、MLTreeMap、AMPHORA、CARMA

   工具的原理:

       MEGAN applies a simple lowest common ancestor algorithm to assign reads to taxa, based on BLAST similarity search results. 与数据库比对确定物种信息。

       MLTreeMap and AMPHORA are two phylogeny-based phylotyping tools that use the phylogenetic analysis of marker genes for taxonomic distribution estimation. Phylogenetic analysis of marker genes, including 16S rRNA genes, DNA polymerase genes, and 31 selected marker genes have also been applied to determining taxonomic distribution. 基于16s、DNA聚合酶等标记基因研究进化关系,从而决定物种分布。

       CARMA searches for conserved Pfam domain and protein families in the raw metagenomic sequences and classifies them into a higher-order taxonomy, based on the reconstruction of a phylogenetic tree of each matching Pfam family. 根据Pfam和蛋白家族的保守关系去界定物种。

   引发的问题:

       在对物种进行研究时,需要对序列进行bin的划分,也就是对组装得到的序列进行聚类。

       目前大部分工具的原理是基于DNA序列组成,Most existing computational binning tools simply utilize DNA composition. The basis of these approaches is that genome G+C content, dinucleotide frequencies, and synonymous codon usage vary among organisms, and are generally characteristic of evolutionary lineages.(设想一下,未来应当是随着研究的深入,发现目前基于DNA序列的方法有很多缺陷与不足,比如说DNA构象是不是要考虑呢,这样的话目前研究结果就不是很完善,此处的关键在于明确清楚代表DNA序列的特征,并知道哪些特征对聚类有较大影响)

       相关的工具有:TETRA、MetaClust和 CompostBin

           TETRA uses z-scores from tetramer frequencies to classify metagenomic sequences.

           MetaClust uses a combination of k-mer frequency metrics to score metagenomic sequences.

           CompostBin, a semi-supervised approach, uses a weighted PCA algorithm to project high dimensional DNA composition data into an informative lower-dimensional space, and then uses the normalized cut clustering algorithm to classify sequences into taxon-specific bins.

 

3)功能预测(function prediction)

   目前这块的算法和工具还较少;

    常见的预测多为COG families, KEGG families, FIG families,注释的流程多是传统基因组的流程。

   工具有MG-RAST,is an automatic server for subsystem annotation for metagenomic datasets, based on an extension of the very successful microbial genome annotation server RAST.

   CD-HIT algorithm:rapid analysis of the sequence diversity for very large metagenomic datasets using a clustering approach.


4)比较宏基因组学的研究(Comparative metagenomics)

   多基于序列的比较分析

   工具有UniFrac、MEGAN;

   UniFrac,a very popular tool for comparing communities based on the lineages,calculates the phylogenetic distances between two communities as the fraction of the branch length of the phylogenetic tree

   MEGAN,provides visual and statistical comparison of metagenomes based on what the lineages they contain.

   对于基因组学的研究来讲,除了基于序列信息外,Microbial communities can also be compared based on other types of information, such as the functions encoded by metagenomes.


5)宏基因组学中的统计工具(Statistical tools for metagenomics

   Phylogeny-based statistical tools for comparing community structures include integral-LIBSHUFF, TreeClimber, UniFrac.

   AMOVA,analysis of molecular variance,which determines whether the genetic diversity within two or more communities is greater than their pooled genetic diversity.

   HOMOVA,homogeneity of molecular variance,which determines whether the amount of genetic diversity in each community is significantly different.

   Metastat was developed for detecting significantly different features (such as taxa, biological pathways, or gene families) between two populations, aiming to study how two populations are different from each other.

   不同的统计工具有着各自适应的条件,可见文章Evaluating different approaches that test whether microbial communities have the same structure


6)菌群与环境的关系研究(Modeling interactions between microbes and their enviroment)

   目前对于这个研究还较少,提出“metabolic footprint”和基于“network”的研究


7)研究过程中需要注意的一些点

   (1)16s rRNA chimeras could lead to inaccurate estimation of the species diversity of a community

       嵌合体对结果的影响,可从两个方面考虑减少嵌合体:实验流程或者emulsion PCR技术的改进,数据处理端的优化,如Bellerophon、Pintail和Mallard.

   (2)Artificial replicates may introduce systematic artifactes to the estimation of gene and taxon abundance

       研究发现11% and 35% of sequences in a typical metagenome are artificial replicates.

   (3)Gene family frequencies derived based on read counts in metagenomic data may be unreliable due to different gene family lengths

       主要是为了排除基因长度的影响

   (4)Be aware of artificial pathways

       MinPath,用最少的pathway去解释所有注释到的功能

 

8)存在的挑战

   (1)Scalability

       数据量的庞大,NGS得到数据的地方越来越快,越来越多,对计算和分析带来了较大挑战

   (2)Integration of metaproteomic, metatranscriptiomic and metagenomics data sets

       基因组、转录组、蛋白组不同层次数据的整合和研究



https://m.sciencenet.cn/blog-306699-913777.html

上一篇:HMP计划的第一读
下一篇:Metastats的原理解读

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...
扫一扫,分享此博文

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-11 15:08

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部