科学网

 找回密码
  注册
科学网 标签 RDP

tag 标签: RDP

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

Ribosomal Database Project(RDP) Classifier的物种分类原理
xbinbzy 2015-12-8 16:41
文章: Na ̈ıve Bayesian Classifier for Rapid Assignment of rRNA Sequencesinto the New Bacterial Taxonomy RDP Classifier是基于朴素贝叶斯的原理进行分类,其中关键需要计算后先验概率和条件概率。 1)其中选择8-bp的subsequences作为特征序列去计算相应的概率。理由如下: The Classifier uses a feature space consisting of all possible 8-base subsequences (words). Word sizes between 6 and 9 bases were tested in preliminary experiments. Sizes of 8 and 9 bases gave nearly identical results, while sizes of 6 and 7 bases were less accurate , especially with shorter test sequences (not shown). As there are fewer possible words of size 8 than size 9, size 8 was chosen for all further work to reduce memory requirements. The position of a word in a sequence is ignored(忽略subsequences出现的位置) . As with text-based Bayesian classifiers, only those words occurring in the query contribute to the score. A similar word-based classification scheme has been used to search for horizontal gene transfer events in whole-genome sequences. 2)每个subsequences出现的概率计算。具体计算方法: Let W 􏰁 { w 1 , w 2 , . . ., w d } be the set of all possible eight-character subsequences (words). From the corpus consisting of N sequences , let n ( w i ) be the number of sequences containing subsequence w i . The expected-likelihood estimate (determined with the Jeffreys-Perks law of succession) calculated for each word over the entire corpus with the formula P i 􏰁= /( N 􏰀+ 1) was used as a word-specific prior estimate of the likelihood of observing word w i in an rRNA sequence. The values 0.5 in the numerator and 1 in the denominator keep the probabilities in the range 0 P i 􏰂 1. 3)条件概率的计算。具体计算策略: For genus G with a training set consisting of M sequences, let m ( w i ) be the number of these sequences containing word w i . The conditional probability that a member of G contains w i was estimated with the equation P ( w i | G ) 􏰁= /( M 􏰀+ 1) . Ignoring the dependency between words in an individual sequence, the joint probability of observing from genus G a (partial) sequence, S , containing a set of words, V 􏰁 { v 1 , v 2 , . . ., v f } ( V ⊑ W ), was estimated as P ( S | G ) 􏰁 = π P ( v i | G ). 4)基于贝叶斯公式计算值的物种分类。具体分类策略: By Bayes’theorem, the probability that an unknown query sequence, S , is a member of genus G is P ( G | S ) 􏰁= P ( S | G ) x 􏰄 P ( G )/ P ( S ), where P ( G ) is the prior probability of a sequence being a member of G and P ( S ) the overall probability of observing sequence S (from any genus). Assuming all genera are equally probable (equal priors), the constant terms P ( G )and P ( S ) can be ignored. We classify the sequence as a member of the genus giving the highest probability score, but we ignore the actual numerical probability estimate. 5)鉴于每个序列的计算都是不同的,为此为了保证结果的准确性,采用bootstrap的策略。具体描述如下: For each query sequence, the collection ofall eight-character subsequences (words) in the query was first calculated. Normally, when data consist of independent features, a bootstrap sample size equal to the number of features in the original sample is chosen. In this case, the number of completely independent features equals the number of nonoverlapping words. So for each bootstrap trial, a subset of one-eighth of the words was randomly chosen (with replacement) and the words in this subset were then used to calculate the joint probability. The number of times a genus was selected out of 100 bootstrap trials was used as an estimate of confidence in the assignment to that genus. For higher-rank assignments, we sum the results for all generaunder each taxon.
个人分类: 科研文章|12326 次阅读|0 个评论
基因重组分析及检验图解教程(Simplot 篇)
raindyok 2014-5-20 20:11
【 絮语】   基因重组是病毒进化的一个重要机制,通过基因重组病毒可以产生大量的遗传变异,这远比仅仅由突变造成的变异来得更快。为揭示基因 重 组等遗传机制在某些基因进化中的作用,通常需要开展重组分析,以获得可能存在的重组信号,并通过相应的方法进行潜在重组信号的可靠性检验。年前受多位群友之邀(生物软件交流群:323809849),原想以 《马铃薯Y病毒福建分离物P1基因的分子变异和结构特征》文中及相关内容为示例整理成图解教程,但因该文投稿经历了一些曲折,现已由《遗传》录用,故此耽搁了时间。为此,本文陆续补上 基因重组分析相关软件的使用,以飨各位初学者。    Raindy 注 : 本人QQ空间第一更新,可以移步: http://user.qzone.qq.com/58001704/blog/1400381680 ------------------------------------------------- Simplot篇 -------------------------------------------------   Simplot 是 Sim ilarity Plot ting 简写形式,意即序列相似性作图,是一款非常流行的重组分析工具,据作者官网统计,截止目前,至少有900多篇文献引用了该软件,足风其盛行程度。但该软件只支持Windows 系统,且从2003年后软件保持3.5.1版本,一直未做更新。   Simplot 主要通过对待分析的序列和参考序列进行相似性作图,通过绘制的点图判断目的序列是否存在重组:(1)当参考序列的点图呈现近似平行线时,说明目的序列与其中某条参考序列的序列高度相似,不存在重组信号;(2)当参考序列的点图呈交叉时,说明可能存在重组信号。如下图所示,福建QK44分离物的P1基因检测可能存在重组信号,因为该分离物P1基因前半段与Oz分离物序列 相似性 (相似性一词适用于不同物种之间,当分析对象是同一物种时,改为称序列一致性更为准确)非常高,而后半段则与Mont分离物(或N-605分离物)的序列一致性最高。   Raindy 注:Oz分离物(O株系),Mont分离物或N-605分离物(N株系),N、O株系为PVY重组分离物的亲本,故而这些分离物可以作为PVY重组分析的参考序列。 操作流程:    1. 序列分组后比对序列 ,如下图,红色圈处为参考序列组,蓝色为待分析的P1基因不同分离物。当序列名称前几个字符相同时(如下图的FJ),Simplot程序会自动将其归为一组。针对这个特点,在给不同的参考序列组命名时,建议采用独特的名称,便于Simplot自动分组。 序列比对完成后,建议保存为*.fasta格式。   2.运行Simplot ,默认打开SeqPage,此时点击菜单上的“File”-“Open”-选择上步保存的fasta文件,默认所有序列被选中。对于待分析的序列组,建议逐个选择进行分析,如下图的F组,可以先选定FJ|QK44。 此时切换到“Simplot”标签,再点击菜单栏“Command”-“Query”-选择“F”。 此时,右下角的“Start Scan”由原来的灰色显示为高亮,点击该按钮即可开始分析,结果如下图所示,说明QK44分离物可能存在重组信号。 同样,当分析LY30分离物时,Simplot结果显示,该分离物的P1基因与Oz分离物的 核苷酸序列一致性最高(超过95%),表明LY30可能不存在重组信号。 3. 结果输出 Simplot绘制的点图可以直接保存BMP图片或图元文件,但可能分辨率达不到期刊的要求,因此推荐将数据导出,在Excel中或相关软件中绘制,点击菜单栏的“File”-“Save Chart Values as CSV”即可将数据导出为CSV格式。 在Excel中打开,选择相应数据,“插入”-“折线图”进一步美化,如下图所示: 由于Excel转化PDF时,页面设置参数会发生变化,可能导致上图中的横坐标数据显示不全,故而也可以试试用R语言来重绘图片,效果如下图所示: 考脚本如下 : setwd(D:\\R-out) A-read.csv( test.csv ) par(mar=c(8,4.5,6,1)) plot(x=A$x,y=A$y1,xaxt=n,yaxt=n,xlim=c(101,9101),ylim=c(0,1),xlab= Nucleotide position ,ylab= Sequence identity ,col=#FF0000, pch=16, cex=0.2, mgp=c(3,1,0)) lines(A$x,A$y1,lty=1,lwd=2,col=#FF0000) par(new=TRUE) plot(x=A$x,y=A$y2,xaxt=n,yaxt=n,xlim=c( 101 , 9101 ),ylim=c(0,1),xlab= ,ylab= ,col=#00B050, pch=16, cex=0.2) lines(A$x,A$y2,lty=1,lwd=2,col=#00B050) par(new=TRUE) plot(x=A$x,y=A$y3,xaxt=n,yaxt=n,xlim=c( 101 , 9101 ),ylim=c(0,1),xlab= ,ylab= ,col=#00B0F0, pch=16, cex=0.2) lines(A$x,A$y3,lty=1,lwd=2,col=#00B0F0) arrows( 400,0.83,700,0.85 ,code=1,length=0.1,col=black,lty=1,lwd=3) arrows( 1800,0.89,2100,0.91 ,code=2,length=0.1,col=black,lty=1,lwd=3) axis(1,seq( 101,9101 ,by=300),las=2,tck=0.01) axis(2,seq(0,1,by=0.1),las=1,tck=0.01) text( 960,0.87 , RJ1 ) text( 1755,0.87 , RJ2 ) legend(2900,0.5,bty=n,legend=c( Mont|AY884983 , Oz|EF026074 , Wilga5|AJ890350 ),col=c(#FF0000,#00B050,#00B0F0),lty=1,lwd=2,y.intersp=1.5) savePlot(plot, type=c(pdf),device=dev.cur(),restoreConsole=TRUE) Raindy注:脚本中红色标记,可以根据实际情况进行调整,特别感谢 @刘阳 提供脚本。 示例文章: 《马铃薯Y病毒福建分离物P1基因的分子变异和结构特征》: http://www.chinagene.cn/Jwk_yc/CN/abstract/abstract21219.shtml
个人分类: 软件教程|44339 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-2 14:26

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部