ppprotein的个人博客分享 http://blog.sciencenet.cn/u/ppprotein

博文

RPS-Blast 安装和使用

已有 11414 次阅读 2013-9-16 08:53 |个人分类:bioinformatics|系统分类:科研笔记|关键词:学者

RPS-Blast 类似于 HMMER,在CDD数据库里搜索某蛋白是否含有一domain或者是否属于某一family


到 ftp://ftp.ncbi.nlm.nih.gov/pub/mmdb/cdd 下载 cdd.tar.gz , 解压


格式化数据库
makeprofiledb -title Pfam.v.xxx -in Pfam.pn -out Pfam -threshold 9.82 -scale 100.0 -dbtype rps -index true

运行搜索

rpsblast -i ceramide_glucosyltransferase.fa -d /data1/RefGenome/Pfam/Pfam -F T -e 0.01 -m 9

note 1:rpsblast输入文件为fasta格式的蛋白序列文件
 
note 2:不使用formatrpsdb,会报错。



NCBI的CCD ftp数据库里的README有详细的说明。

附录:
===============================================================================
cdd.tar.gz
===============================================================================

"cdd.tar.gz" is a gzipped archive file that contains Position-Specific
Scoring Matrices (PSSMs) originating from all of the alignment collections
encompassed by the Conserved Domain database project.

       (Scope A: this file includes data from ALL CD models;
       see section on "SCOPE OF DATA in FTP FILES" for details)

To build search databases for RPS-Blast you need to unpack the
archive and extract its contents. It contains ascii formatted
files only, with the following extensions:

*.smp ...... Position Specific Scoring Matrices (PSSMs). These are
             stored in a new ASN.1 format ("scoremat"), which is shared
             between various BLAST applications.
*.pn ....... lists of PSSM file names
 
and allows for the compilation of 5 RPS-Blast search databases

Smart
Pfam
Cog
Kog
Prk
Cdd  (domains from Smart, Pfam, COG, PRK, and cd,
      this is the set that's indexed in NCBI's Entrez)
 
The databases must be formatted with the "makeprofiledb" application
that is distributed with the BLAST executables
(ftp://ftp.ncbi.nih.gov/blast/executables/).  
Be sure to use recent BLAST executables
in order to obtain the makeprofiledb application that is compatible
with the CDD FTP files. (The formatrpsdb application packaged
with earlier BLAST releases is not compatible and will result in
an error message, "unable to match element in intermediateData...
error no data found in file.")

The following sequence of commands will build the search databases:

 
makeprofiledb -title SMART.v6.0 -in Smart.pn -out Smart -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title Pfam.v.26.0 -in Pfam.pn -out Pfam -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title COG.v.1.0 -in Cog.pn -out Cog -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title KOG.v.1.0 -in Kog.pn -out Kog -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title CDD.v.3.10 -in Cdd.pn -out Cdd -threshold 9.82 -scale 100.0 -dbtype rps -index true

makeprofiledb -title PRK.v.6.00 -in Prk.pn -out Prk -threshold 9.82 -scale 100.0 -dbtype rps -index true


Note that the parameter '-threshold' supplied with makeprofiledb, the three-letter
word score threshold for detecting and extending hits in RPS-Blast, will
determine the size of the search database. A lower threshold
will result in larger databases and slightly increased search sensitivity,
at the cost of additional memory requirements and reduced search speed.
Matrices distributed for creating RPS-Blast search databases are scaled by a
factor of 100 (parameter -scale). A score threshold value of 9.82 will result in
search-databases of a size very similar to using unscaled matrices and
a threshold value of 11.

Note also that the RPS-Blast search databases generated by makeprofiledb
are architecture dependent, it may not be possible to create them on one
and use them on another platform.

When searching with your local version of RPS-blast, use the command-line
argument "-d" to specify the database name and location. You need an
executable version of the "rpsblast" program, type "rpsblast" without
arguments to obtain a list of command-line options.
 
You can now take any arbitrary subset of PSSMs and compile them into an
RPS-Blast search database. All that makeprofiledb needs is a list of file
names (such as "Smart.pn" in the example above) and the corresponding
"scoremats" (*.smp) files. Newer versions of Psi-BLAST (blastpgp) can now
write out "checkpoints" in the "scoremat" format as well (blastpgp parameter
-u1). These again can be combined with arbitrary subsets of scoremat-
formatted PSSMs distributed here, to create customized RPS-Blast search sets.
The scoremat-formatted PSSMs distributed here are scaled with a factor 100.0,
and if one was to combine them with Psi-BLAST generated "scoremats", the
same scaling factor must be set as a parameter with makeprofiledb.

Note: If you prefer to use preformatted databases, see the big_endian and
little_endian subdirectories of the CDD FTP site. They contain databases
that have been preformatted for use with various architecture/OS combinations
(Intel, Sun, SGI / Linux, Windows, Solaris, IRIX).



https://m.sciencenet.cn/blog-981687-725295.html


下一篇:tophat的Max multihits参数设置对gene expression估计的影响

0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

数据加载中...

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-23 14:01

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部