科学网

 找回密码
  注册

tag 标签: RNA-seq

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

Ensembl、UCSC、Refseq,该用哪个
ChengyangWang 2017-12-15 14:42
本文转载自嘉因微信公众号,已获得授权。查看最新文章,敬请关注嘉因,微信ID:rainbow-genome 作者:小哈 来源: 嘉因 大家都会做方便面,有人做辛拉面,有人做三鲜伊面,工艺有何不同? 大家都会做RNA-seq,有人能筛出有意义的基因,有人能找出有价值的线索,有人。。。差别在哪? 前四期介绍了数据均一化处理、差异基因筛选、画heatmap和富集分析的合理方法: 第一期:数据预处理: 同一套RNA-seq,为什么公司做的跟师兄跑的结果不一样? | TPM、read counts、RPKM/FPKM你选对了吗? 第二期:差异基因筛选: 同一套RNA-seq,公司筛出的差异基因跟师兄筛出的为什么不一样?| Pvalue, FDR, cutoff 第三期:heatmap: heatmap画不好会得出错误结论 | 数据预处理、聚类分析,HCL、 K means里的讲究 第四期:富集分析: 富集分析,俩人做的结果差5岁 | 你用的注释文件有多老? 小哈让我们算read counts, 可是, 为什么我算的read counts跟公司算的还是不一样 ?本期回过头来看mapping时选用的Gene model对结果的影响。 拿到测序数据,首先要把read回帖到基因组上,这时需要基因组序列fasta文件,还要告诉它基因组上哪个位置有基因,即gene model,保存在 gtf文件 里。 如果分析人或小鼠的数据,就用 GENCODE 。那么,著名的Ensembl、UCSC、Refseq,跟GENCODE是啥关系?其他物种用哪个呢? Ensembl说目前这个版本的 GENCODE = Ensembl ,www.ensembl.org/Help/Faq?id=303 只有GENCODE自己知道,它跟ensembl还是有些区别的,GTF文件稍有不同,www.gencodegenes.org/faq.html 点击查看清晰大图 Ensembl、UCSC、Refseq,选择不同, 对结果有多大影响 ? 有人专门做了对比。 先总体评价了三种gene model对mapping的影响; 然后举例细看对某些基因的具体影响; 先说结论: Gene model 会影响 基因表达量乃至差异表达基因的筛选,尤其是不同gene model对某些基因的长度、junction位点注释有出入; Ensembl的注释相对更加准确,基因更多; 推荐 人鼠用 GENCODE ,谁让它出自最权威的ENCODE呢,其他物种用 Ensembl 。 下面逐个查看文章里的结果: The read mapping summary in the “transcriptome only” and “transcriptome + genome” mapping modes: more reads are mapped in Ensembl than in RefGene and UCSC in the “transcriptome only” mode more reads become multiple-mapped in Ensembl than in RefGene and UCSC The RefGene and UCSC consistently have the highest percentage of uniquely mapped reads; while the percentage of non-uniquely mapped reads is much higher in Ensembl. Without a gene model (indicated in pink) in the mapping step, a constant 6% of reads become unmapped. Divided uniquely mapped reads into two classes, i.e., non-junction reads and junction reads , and investigated the impact of a gene model on their mapping. The impact of a gene model on mapping of non-junction reads is different from junction reads. For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. By contrast , this percentage dropped to 53% for junction reads. In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10– 15% mapped alternatively. The overlap and intersection among RefGene, UCSC, and Ensembl annotations In general, different annotations have very high overlaps: there are 21,598 common genes shared by all three gene models. RefGene has the fewest unique genes while more than 50% of genes in Ensembl are unique . The correlation of gene quantification results between RefGene and Ensembl Although the majority of genes have highly consistent or nearly identical expression levels, there are many genes whose quantification results are dramatically affected by the choice of a gene model 具体看每个基因的read counts,用Ensembl和RefGene算出来的read counts差好远,为什么呢?下面举例看2个基因的情况 The different gene definitions for PIK3CA give rise to differences in gene quantification PIK3CA in the Ensembl annotation is much longer than its definition in RefGene , explaining why there are 1094 reads mapped to PIK3CA in Ensembl , while only 492 reads are mapped in RefGene . The PIK3CA gene definition in Ensembl seems more accurate than the one in RefGene, based upon the mapping profile of sequence reads. The different gene definitions for LUZP6 . In the Ensembl annotation, LUZP6 is only 177 bp long, and it is completely within another gene, MTPN. As a result, all sequence reads originating from LUZP6 are assigned to MTPN instead. In RefGene , LUZP6 and MTPN are derived from the same genomic region, and both encode exactly the same mRNA, though the protein coding sequences are different. Therefore, all reads mapped to this region are equally distributed between these two genes. The correlation of the calculated Log2Ratio (heart/liver) between RefGene and Ensembl. Although the majority of genes have highly consistent expression changes, there are many genes that are remarkably affected by the choice of different gene models .
个人分类: RNA-seq|13524 次阅读|0 个评论
欲哭无泪的p-value = 0.051 | 做几次重复能得到较低的p-value
ChengyangWang 2017-12-15 14:12
本文转载自嘉因微信公众号,已获得授权。查看最新文章,敬请关注嘉因,微信ID:rainbow-genome 作者:小哈 来源: 嘉因 60分万岁,多1分浪费 p-value = 0.051。。。。。。 场景一:做RNA-seq,做几次重复?应该做几次?发paper时认可几次重复? 场景二:RNA-seq做了3次重复,用p-value 0.05筛出的差异基因太少,只用2次重复来筛,筛出了好多,好开森 ~ ~ ~ 场景三:基因KO组跟对照组比,计算药物处理后表型是否有显著差异。做了3次实验,p-value 0.05;继续做,做到5次重复,p-value 0.05,好开森 ~ ~ ~ 减少重复次数?增加重复次数?怎样是合理的? 来看看statQuest视频讲 Power in statistics,需要几次重复 能获得较低的p-value。 倒霉蛋儿不只你一个, 增加重复次数 会发生什么? 多做了重复后,30%的情况p-value降到0.05以下, make sense?NO!那叫 假阳性 不可以为了获得好的p-value而增加重复,那会增加假阳性。 正确的做法 是: 做实验之前,算一下到底需要多少样本量; 如果实验前没评估样本量,就评估之后重新开始做实验。 (你们这些做统计的不懂实验狗的忧伤) 怎样评估需要做几次重复呢? 实验前评估需要多少次重复, power calculation ,受4个因素影响: 第4个是统计方法,最广泛应用的t-test有最强的power。重点考虑前三个因素: 最好是有前期数据或已发表的数据,有第六感也成: 如果没有前期数据和第六感,还有办法: 对于RNA-seq数据: 具体怎么算? 交给电脑算,点击左下角“阅读原文”直达G*Power网站。 刚更新,17 July 2017 - Release 3.1.9.3,有Windows版本和Mac版本。用法简单: 你可能需要一些基础知识,搬来简洁易懂的statQuest视频。 one or two tail t-test的另一个视频 回到开头的场景,解决方案: 才不管怎么算呢,We always use 3 patients...
个人分类: RNA-seq|5572 次阅读|0 个评论
[转载]RNA-seq要做几次生物学重复?找出来的100%都是真正的应答基因
ChengyangWang 2017-12-15 10:44
本文转载自嘉因微信公众号,已获得授权。查看最新文章,敬请关注嘉因,微信ID:rainbow-genome 作者:小丫 来源: 嘉因 尹师妹:“哈师兄,做验证实验好辛苦,老板让我提高筛选差异基因的条件,尽量降低假阳性,我该怎么筛?” 小哈打开Evernote,给尹师妹看张表: “瞧见那个100%了吗?30 million mapped reads的情况下,10次重复,2倍筛选条件, Statistical power100%, 找出来的都是真的应答基因;只做3次重复,2倍筛选条件,可以达到87%; 如果测序深度降到15 million mapped reads,需要10次重复,才能到85%。” 尹师妹:“我的样品有30 M mapped reads,3次重复,我用2倍,87%的 Statistical power ,我觉得这样可以接受。” https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8 查看《生信小硕乱入生物实验室的幸福生活》系列其他文章,请关注嘉因生物,回复小哈+文章编号,例如回复“小哈1”。 小哈1. 哈师弟的博士研究僧之旅开篇 小哈2. 怎样批量查看lncRNA跟疾病的关系 小哈3. 如何避免批次效应导致的结果不可靠 小哈4. 缺了对照会怎样 小哈5. 家族遗传病如何设计测序实验 小哈6. 遗传病的显隐性、伴性遗传的判断 小哈7. Jane帮你选期刊,选审稿人 小哈8. 用Gnosis直接按影响因子检索paper 小哈9. 组蛋白修饰预示着什么? 小哈10. 药物处理多久后能看到组蛋白修饰的变化? 小哈11. lncRNA上的SNP对其作用机制的影响 小哈12. 需要测多少数据量?read数和G的换算 小哈13. RT-qPCR验证,选哪个lncRNA的哪段设计引物? 小哈14. 研究单个基因的生物信息学分析工具(大全) 小哈15. miRNA的RT-qPCR验证,小伙伴儿亲测,高效,便宜,不限物种(附各试剂货号) 小哈16. RNA-seq要做几次生物学重复?找出来的100%都是真正的应答基因
个人分类: RNA-seq|2256 次阅读|0 个评论
TGICL的使用及结果解读
热度 1 luria 2017-6-1 15:19
Trinity无比强大,但是在组装结果还是太零散,此外内存占用也是硬伤。如果你有一个土豪机器,能一次运行几百G的程序,那建议将所有样本的PE reads清理后R1,R2相应合到一起,再用Trinity 进行PE组装。没有强大机器的人们一般会将多个样本分开组装,再通过一些手段对contig再组装,最后生成所有样本共同的Unigene,得到所有样本统一的reference,来比对计算差异表达。 这里推荐一款老牌的聚类组装软件——TGICL ( TGI Clustering tools ),完成这个contig再组装的过程。它先使用mageblast对输入的fasta文件进行比对后聚类成cluster,再使用CAP3对每个cluster进行组装。此外因为聚类后各cluster间相互独立无交集,程序提供多线程的选项(来弥补CAP3这一步计算速度慢的缺陷)。 1. 下载安装 到https://sourceforge.net/projects/tgicl/files/?source=navbar下载最新版的TGICL (以最新版v2.1版为例)。建议下载TGICL-2.1.tar.gz原程序,貌似rpm/deb二进制的安装会有报错,具体没有深究,感兴趣的朋友可以使用rpm/deb格式的安装包进行安装,欢迎将过程分享到此博文下,谢谢。 tar -zxvf TGICL-2.1.tar.gz cd TGICL-2.1 perl Build.PL ./Build ./Build test ./Build install #如果哪一步有permission denied的提示,则需要在命令前加sudo 如果在终端中输入 perldoc -F /usr/local/bin/tgicl 出现TGICL的help页面(如下图)则表示安装完成! 2. TGICL的使用 以某处为工作目录,将需去冗余和组装的fasta序列(例如Trinity组装结果中大于500bp的序列文件trinity_gt500.fasta)放到工作目录下,运行 tgicl -F trinity_gt500.fasta -c 3 这时可能会出现如下提示,可以不用理会: Use of :locked is deprecated at /usr/local/share/perl/5.18.2/TGI/DBDrv.pm line 36. 运行完结果如下 err_tgicl_trinity_gt500.fasta.log和tgicl_trinity_gt500.fasta.log分别是标准错误输出和标准输出的记录文件,如果运行中无报错,两者应该是一样的。文件中记录了建立索引、聚类和组装三个过程的一些信息。 重点来了: Q1:工作目录下trinity_gt500.fasta, trinity_gt500.fasta_cl_clusters, trinity_gt500.fasta.singletons和masked.lst这四个文件之间是什么关系? 为了清晰起见,这里既作了后三个文件中序列的韦恩图,又作了四个文件中序列的韦恩图。 结果显示 A. trinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons文件中的序列居然有交集! B. trinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons的并集就是trinity_gt500.fasta中全体序列 C. masked.lst里的序列既有 trinity_gt500.fasta_cl_clusters里的,又有trinity_gt500.fasta.singletons里的序列 Q2:asm文件夹下的文件与cluster是什么关系? 因为-c指定用3个CPU,这里会生成3个文件夹asm_1,asm_2,asm_3。从这3个文件中的log_std发现这些log_std里的CL没有重复,而且加在一起正好是trinity_gt500.fasta_cl_clusters里的结果,表明程序先将trinity_gt500.fasta_cl_clusters中的cluster分给N个CPU(在N个文件夹中)来做的。然后,CAP3组装每个asm文件夹下log_std中的clusters,然而并不是每个cluster里的序列都能用上,组装没有用上的序列就写进了每个asm文件夹下的singletons文件里。 Q3:在第1个问题里rinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons为什么会有交集,交集又是什么? Bingo! 想必大家也猜到了,这个交集就是所有asm文件夹下singlets文件中序列的并集。为了验证这个猜想,先检查一下所有asm文件夹下singlets文件是否有交集,发现没有,不出所料。所有asm文件夹下singlets文件里的序列的并集就是trinity_gt500.fasta.singletons和trinity_gt500.fasta_cl_clusters序列里的交集。发现正好是的,有猜想一致! 综上所述,最终的结构图是 因此只需要将trinity_gt500.fasta.singletons里的序列提出来,再将各asm文件夹下的contigs合并到一起换个Unigene的ID号即可。 vim collect_tgicl_result.py 将以下代码复制到collect_tgicl_result.py中 #!/usr/bin/env python import sys, os from Bio import SeqIO '''collect_tgicl_result.py''' def main(tgicl, singleton, fasta, *asm_dirs): unigene = 1 with open('result.fasta', 'w') as result: for i in asm_dirs: for x in SeqIO.parse(os.path.join(i, 'contigs'), 'fasta'): print result, 'cluster_contig%s\n%s' %(unigene, x.seq) unigene += 1 fa_dic = {fa.id:fa.seq for fa in SeqIO.parse(fasta, 'fasta')} for i in open(singleton): print result, '%s\n%s' %(i.strip(), fa_dic ) if __name__ == '__main__': if len(sys.argv) == 1: print (' collect TGICL result\n' ' python %s singletons fasta asm*\n' ' singletons is located in workspace\n' ' fasta is fasta file which was used to be clustered and assembled\n' asm* are asm directories, `contigs' file must be in those dirs\n ) %(sys.argv ) sys.exit(1) main(*sys.argv) 在工作目录下运行如下: python collect_tgicl_result.py trinity_gt500 .fasta.singletons trinity_gt500 .fasta asm_* 运行完会生成一个tgicl_result.fasta文件,即TGICL聚类组装后最终的Unigene。 参考材料 [1] Geo Pertea1, Xiaoqiu Huang, Feng Liang, et al. TIGR Gene Indices clustering tools (TGICL):a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19(5):651-652 [2] http://bioinformation.cn/?p=563
个人分类: RNA-seq|10942 次阅读|2 个评论
由链特异性RNA-SEQ联想到的
liujd 2017-4-18 19:43
个人分类: 生物信息|9 次阅读|0 个评论
零基础也可以做RNA-seq差异分析
LLina 2017-3-3 12:33
基因表达谱的 差异分析 是RNA-seq中最常见的应用。你眼中的RNA-seq差异分析或许是酱紫的,对不会编程,不懂统计,纯正生物学出生的人, 内心简直SOS …… 但有些人眼中的RNA-seq差异分析却是这样的,借助云平台和图形化界面,生信零基础同样可以做RNA-seq差异分析。 本人也是第一次尝试,一起来看看云平台如何拯救我们于水火之中。 平台用的是 GCBI ,之前也介绍过用GCBI做DNA测序和芯片分析,还不会用的 在这里 看攻略。RNA测序原始数据太大,就直接用了demo数据。 1.新建一个RNA测序方案,不知道怎么新建请看上面攻略。进去后的分析界面就是酱紫的。 2. 分析首先得有样本是不,点击“添加样本数据”将样本导入。如果用的是自己的测序数据,看下这里的 数据上传说明 ,在这就不展开了。导入数据后可在“数据表”和“结果图”中查看质控分析结果。 接下去就是将导入的数据进行分组,点击“添加新分组”建立组别,建了两组,分别是EG(实验组)和CG(对照组)。 分好组后就可以进行差异分析啦,选择要分析的组别进行下一步。 3.差异分析参数选择 在差异分析中,参数主要就是P值,Q值和fold change,分析时可默认,也可自己设定。P值和Q值还模模糊糊的 看这里 。在这就直接用默认了。 4.运行结果 好啦,点击确定就ok啦。Demo数据是五对直肠癌和癌旁数据, 在默认参数条件下共筛选出了395个差异基因,上调的137,下调的258。下方还提供给了一些标签分类,如统计了在cosmic中与肿瘤相关的基因数量等。看每个基因具体的表达值点击结果表即可。 面对那么多基因,该如何快速找出感兴趣的基因呢?用上面的筛选功能即可,通过疾病,也可自定义,自定义的选项分类很详细,针对性比较强。这里附上 数据筛选指南 ,给大家做筛选提供一些方向。 好啦,RNA-seq差异分析酱紫就完结啦。 这个分析流程采用的是HISAT2,StringTie,Ballogown组合,和传统流程的用的cufflinks, tophat相比,优势在哪? 看这里 。简单一句话,Tophat 首次被发表已经是7年前,Cufflinks也是6年前的事了。 RNA-seq差异分析网址: https://www.gcbi.com.cn/gclab/html/index
个人分类: 生信分析|19568 次阅读|0 个评论
[转载]基因表达量表示方法RPKM VS FPKM
lemoncyb 2016-11-11 03:18
我们都知道RNA-seq是通过NGS技术来检测基因表达量的测序方法。在衡量基因表达量方面,若是单纯以比对到参考基因的Reads个数(我们通常称之为Count值)来衡量基因的表达量,在统计上是一件相当不合理的事。今天就为大家介绍一下衡量基因表达量的RPKM和FPKM两种方法。 在随机抽样的情况下,序列较长的基因被抽到的概率本来就会比序列短的基因高,如此一来,序列长的基因永远会被认为表达量较高,而错估基因真正的表达量。在测序深度不同的情况下,测序深度更深的样品中,比对到每个基因的Read数量更多。 为排除因基因的长度、测序深度等因素造成的干扰,RPKM(Reads Per Kilobase Million)和FPKM(Fragments Per Kilobase Million)等方法就应运而生了。 RPKM (Reads Per Kilobase per Million)和 FPKM (Fragments Per Kilobase per Million) 首先需要解释FPKM和RPKM的原理是相似的,区别在于FPKM对应的是DNA片段,比如在一个Illumina的pair-end(双尾)RNA-seq中,一对(两个)reads对应是一个DNA片段。有了FPKM(RPKM)概念,我们就能比较:同一个样本中基因A和基因B的相对表达量;或者不同样本中,同一个基因的相对表达量。 具体的原因是:引入“每一千碱基(per kilobase)”的原因在于,不同的RNA可能有不同长度,长度越长,对应的reads就越多。当每个RNA都除以自身长度(以1000碱基为单位)时,就可以比较同一个样本中不同基因的相对表达量了。相似地,引入“每一百万reads”的原因是,不同的样本可能测序的深度不一样,深度越深,当然对应的reads就越多了。如果结果除以各自库的数量(以一百万reads为单位),那么我们就能很好地衡量两个不同样本中同一个基因的相对表达量。 RPKM RPKM是将Map到基因的Reads数除以Map到Genome的所有Read数(以Million为单位)与RNA的长度(以KB为单位)。 FPKM FPKM是将Map到基因的Fragments数除以Map到Genome的所有Read数(以Million为单位)与RNA的长度(以KB为单位)。 从公式上可以看出,方法是将Reads(Fragments)Count进行标准化,分别是对测序深度标准化(以Million为单位)和对基因长度标准化(以KB为单位),从而消除了因测序深度和基因长度不同对基因表达量的影响。
个人分类: Bioinformatics|21773 次阅读|0 个评论
RNA-seq分析方法和工具的文章
xbinbzy 2016-2-22 12:44
文章: A survey of best practices for RNA-seq data analysis 2016 ( http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8 ) 文章:A comparison of methods for differential expression analysis of RNA-seq data 2013 ( https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-91 )
个人分类: 科研文章|3075 次阅读|0 个评论
ABBS: Identification of reference genes for qRT-PCR in human
chshou 2015-12-2 09:19
Identification of reference genes for qRT-PCR in human lung squamous-cell carcinoma by RNA-Seq Cheng Zhan, Yongxing Zhang, Jun Ma, Lin Wang, Wei Jiang, Yu Shi and Qun Wang Department of Thoracic Surgery, Zhongshan Hospital, Fudan University, Shanghai 200031, China Acta Biochim Biophys Sin 2014, 46: 330–337; doi: 10.1093/abbs/gmt153 Although the accuracy of quantitative real-time polymerase chain reaction (qRT-PCR) is highly dependent on the reliable reference genes, many commonly used reference genes are not stably expressed and as such are not suitable for quantification and normalization of qRT-PCR data. The aim of this study was to identify novel reliable reference genes in lung squamous-cell carcinoma. We used RNA sequencing (RNA-Seq) to survey the whole genome expression in 5 lung normal samples and 44 lung squamous-cell carcinoma samples. We evaluated the expression profiles of 15 commonly used reference genes and identified five additional candidate reference genes. To validate the RNA-Seq dataset, we used qRT-PCR to verify the expression levels of these 20 genes in a separate set of 100 pairs of normal lung tissue and lung squamous-cell carcinoma samples, and then analyzed these results using geNorm and NormFinder. With respect to 14 of the 15 common reference genes (B2M, GAPDH, GUSB, HMBS, HPRT1, IPO8, PGK1, POLR2A, PPIA, RPLP0, TBP, TFRC, UBC, and YWHAZ), the expression levels were either too low to be easily detected, or exhibited a high degree of variability either between lung normal and squamous-cell carcinoma samples, or even among samples of the same tissue type. In contrast, 1 of the 15 common reference genes (ACTB) and the 5 additional candidate reference genes (EEF1A1, FAU, RPS9, RPS11, and RPS14) were stably and constitutively expressed at high levels in all the samples tested. ACTB, EEF1A1, FAU, RPS9, RPS11, and RPS14 are ideal reference genes for qRT-PCR analysis of lung squamous-cell carcinoma, while 14 commonly used qRT-PCR reference genes are less appropriate in this context. Expression profiling of 15 common reference genes in RNA-Seq data 全文: http://abbs.oxfordjournals.org/content/46/4/330.full 相关论文: 1 Evaluation and validation of a robust single cell RNA -amplification protocol through transcriptional profiling of enriched lung cancer initiating cells 2 Reference Genes Selection and Normalization of Oxidative Stress Responsive Genes upon Different Temperature Stress Conditions in Hypericum perforatum L. 3 Connectivity Mapping for Candidate Therapeutics Identification Using Next Generation Sequencing RNA - Seq Data 4 Application of the whole-transcriptome shotgun sequencing approach to the study of Philadelphia-positive acute lymphoblastic leukemia 5 Identification of Pathogen Signatures in Prostate Cancer Using RNA - seq
个人分类: 期刊新闻|1822 次阅读|0 个评论
[转载]RNA-Seq分析新工具
mashengwei 2015-5-3 09:00
利用RNA-Seq技术来分析 转录组 现在是一种很普遍的方法,在我读PhD期间分析过细菌的 转录组 数据。做 差异表达 分析的基本流程是:做质量控制-利用bwa map reads到基因组上-计算各个基因上unique mapped的reads-利用DEGSeq来做 差异表达 分析(有空的时候我在把这个整理上来)。由于原核生物不存在 可变剪接 ,所以无论使用bwa或者bowtie都可以。最近在处理human genome的 转录组 数据,所使用的方法还是参照之前PLoB上发布的文章《 利用tophat和Cufflinks做转录组差异表达分析的步骤详解 》。(更多关于 转录组 分析工具和方法,譬如 饱和度评估 、Tophat使用等等可以直接在PLoB搜索)。 在最近的一个月内,三篇介绍RNA-Seq数据分析新方法的文章发表在Nature集团旗下的刊物上,其中一篇发表在《Nature Methods》上,另外两篇都发表在《Nature Biotechnology》上。 有趣的是,这三篇文章都有一位共同的作者,那就是约翰霍普金斯大学计算生物学中心的Steven Salzberg。Salzberg是生物信息学和计算生物学领域的杰出科学家,在基因组组装上经验丰富,曾参与人类基因组计划。自新一代测序出现以来,他和他的团队开发了一系列应用程序,其中Bowtie和TopHat程序被广泛下载和引用。 这三篇文章分别介绍了三种新工具: HISAT 、StringTie和Ballgown。它们分别取代了Salzberg之前开发的早期工具,为RNA-Seq的原始读取到 差异表达 分析提供了一种全新的方式。 HISAT 全称为Hierarchical Indexing for Spliced Alignment of Transcripts,由约翰霍普金斯大学开发。它取代Bowtie/TopHat程序,能够将RNA-Seq的读取与基因组进行快速比对。这项成果发表在3月9日的《Nature Methods》上。 HISAT 利用大量FM索引,以覆盖整个基因组。以人类基因组为例,它需要48,000个索引,每个索引代表~64,000 bp的基因组区域。这些小的索引结合几种比对策略,实现了RNA-Seq读取的高效比对,特别是那些跨越多个外显子的读取。尽管它利用大量索引,但 HISAT 只需要4.3 GB的内存。这种应用程序支持任何规模的基因组,包括那些超过40亿个碱基的。 HISAT 软件可从以下地址获取:http://ccb.jhu.edu/software/hisat/index.shtml。 StringTie则由约翰霍普金斯大学联合德州大学西南医学中心开发,能够组装转录本并预计表达水平。它应用网络流算法和可选的de novo组装,将复杂的数据集组装成转录本。与Cufflinks等程序相比,在分析模拟和真实的数据集时,StringTie实现了更完整、更准确的基因重建,并更好地预测了表达水平。 例如,对于从人类血液中获得的9000万个读取,StringTie正确组装了10,990个转录本,而第二名的组装程序Cufflinks只组装了7,187个,提高了53%。对于模拟的数据集,StringTie正确组装了7,559个转录本,比Cufflinks的6,310个提高了20%。此外,它的运行速度也比其他组装软件更快。StringTie软件可从以下地址获取:http://ccb.jhu.edu/software/stringtie/。 Ballgown于3月初发表在《Nature Biotechnology》上,是开展 差异表达 分析的工具。它能利用RNA-Seq实验的数据,预测基因、转录本或外显子的 差异表达 。Ballgown软件的详细说明如下:https://github.com/alyssafrazee/ballgown
3318 次阅读|0 个评论
[转载]如何通过RNA-Seq了解转录本的结构
alinatingting 2014-12-26 15:22
测序转录组的方法可不止一种。一些研究人员的目标是计数 转录本 ,评估表达水平,则测序可代替DNA芯片。而另一些研究人员感兴趣的是转录本的 结构 。大家都知道,真核生物的基因常常经过选择性剪接。是否包含特定的外显子,这有着深远的生物学影响。 前一个应用比较简单,也更加广泛。它与Illumina测序平台的特征相吻合,这些平台提供了短的RNA序列,但每次有数十亿个。而对于后一个阵营的研究人员而言,生物信息学工具和长读取计数才是问题的关键。 长长短短的读取 据Pacific Biosciences的首席科学官Jonas Korlach介绍,哺乳动物的转录本大约在1,000至3,000个碱基,并以多种形式存在。例如,一个基因有5个外显子,则可能出现各种配置,如12345、1245、1345、245等等。弄清这些不同形式的结构和丰度应该不是什么难事,只要测序每个RNA分子并计算其数量。然而,问题在于目前的测序技术无法做到这一点。 Illumina的HiSeq v4试剂每次运行大约产生40亿个高度准确的读取,这对转录组测序而言是足够了。然而,每个双端读取的长度在2 x 125 bp,这就难以确定哪些片段是在一起的。如果这些读取中包含重复元件,则很难定位到基因组中。 斯坦福大学遗传学教授Michael Snyder在接受采访时表示:“你仔细想想,我们研究转录组的方式是疯狂的。我们得到RNA,将其炸成碎片,然后又尝试将它们组合回去,了解转录组一开始是个什么样子。这是一种可怕的方式。” Pacific Biosciences的单分子测序系统 PacBio RS II产生了平均长度在8,500 bp的读取,这足以覆盖大多数的转录本。但RS II的每个SMRT Cell只产生50,000至80,000个读取,这对于全面读取每个转录本而言还是太少。目前,市场上的长读取技术还有Illumina的Moleculo技术和Oxford Nanopore Technologies的纳米孔技术。 混合方法 对于许多研究人员来说,两全的解决方案就是将两种方法相结合。在最近一项发表于PNAS上的研究中,Snyder的研究团队采用混合策略,利用PacBio的长读取和Illumina的短数据来测序一位儿童及其父母的淋巴母细胞转录组。同时,Illumina的读取也能用来检查PacBio碱基检出的错误 。 华盛顿大学西北基因组中心的技术开发主任Jason Underwood也在H1人胚胎干细胞系的转录组分析中采用了这种策略 。他们的“混合测序(hybrid sequencing)”方法鉴定出H1细胞中表达的数百个新基因/长链非编码RNA(lncRNA)以及数千个已知基因的异构体。 不过,Underwood并不总是利用短读取来进行错误校正,在分析鸡的转录组结构时,他只使用了长读取技术 。他利用SMRT测序来产生鸡胚胎心脏的全长cDNA,鉴定出9,000多个新颖的转录异构体,以及Ensembl注释中未包含的500多个基因。 据Korlach介绍,PacBio的技术让研究人员能捕获全部的转录本多样性。在这种称为Iso-Seq的方法中,用户合成cDNA并筛分,创建出不同长度的文库,然后环化并测序。PacBio的SMRT分析软件对相同结构的转录本进行聚类,从而最大限度减少测序错误。互补的策略是环化测序(circular consensus sequencing,CCS),其中cDNA被环化并反复测序,以产生更加准确的平均读取。 鉴于PacBio的读取次数相对较低,一些研究人员将这种技术与选择一些基因的方法相结合。在一项最新的研究中,瑞士巴塞尔大学Peter Scheiffele领导的研究团队利用PacBio方法,对成年小鼠大脑中的370,000个轴突蛋白转录本进行测序,鉴定出这个家族中近1,400个独特的异构体 。 分析工具 为了理解那些数据,Scheiffele的团队使用了一种称为GMAP的算法程序,这也是Underwood使用的。分析转录本结构的其他生物信息学工具包括Cufflinks、SpliceMap和 SigFuge。SigFuge由北卡罗来纳大学教堂山分校D. Neil Hayes副教授的实验室开发,是一种鉴定有趣的结构变异的工具。Hayes则使用它来鉴定数千个患者样本中的癌症标志物。“如果变异很重要,那么它应当是经常性的,”他解释道。有了SigFuge,“我们能够检测RNA结构中经常性的结构变异。” 但是你需要多少序列才能找到它们呢?Hayes认为没有简单的答案。“一般来说,越多越好。但是你测序越多,研究就越昂贵。”他认为每个肿瘤转录组需要6000万个Illumina读取。 作为一般准则,Underwood建议对全转录组分析感兴趣的用户至少分析每个样品的100万个读取。“最低和最高表达的RNA之间可能相差5至6个数量级,”他说。因此,即使是最稀有的转录本,100万个读取应该也够了。这大约需要PacBio仪器上的20个SMRT cell,或每次运行8个cell,2.5次运行。(Jeffrey M. Perkel ) 参考文献 Tilgner, H, et al., “Defining a personal, allele-specific, and single-molecule long-read transcriptome,” Proc Natl Acad Sci USA, 111:9869-74, 2014. Au, KF, et al., “Characterization of the human ESC transcriptome by hybrid sequencing,” Proc Natl Acad Sci USA, 110:E4821–30, published online November 26, 2013, doi: 10.1073/pnas.1320101110. Thomas, S, et al., “Long-read sequencing of chicken transcripts and identification of new transcript isoforms,” PLoS ONE, 9:e94650, 2014. Schreiner, D, et al., “Targeted combinatorial alternative splicing generates brain region-specific repertoires of neurexins,” Neuron, in press, 2014. 转自 测序中国 。
个人分类: 转录组测序|2421 次阅读|0 个评论
RNA提取和建库流程对mRNA-Seq的影响
alinatingting 2014-8-14 14:21
目前RNA-Seq是挖掘不同生长时期及不同胁迫条件下、不同组织细胞中其差异表达基因通常所采用的研究方法,同时还可以鉴定获得新的转录本信息以及不同的可变剪切事件,因而RNA-Seq目前应用很广泛。结合 不同的RNA提取方法及文库构建流程对RNA-Seq获得的测序数据产生不同的影响。 1.关于总RNA提取 关于RNA提取对于大家最为熟悉的是 Trizol-based的RNA提取方法,也有结合试剂盒来进行提取的。在提取总RNA的过程中通常会引入影响后续PCR酶促反应的抑制剂等,这些抑制剂如不正确去除的话,会对后续的反转录、末端修复、加A以及接头连接和PCR扩增等产生影响,如阻碍聚合酶的聚合、影响聚合酶的活性甚至降解聚合酶等,从而对最终获得的测序数据造成影响。 样本中常见的抑制剂包含由样本中本身就带有的和在实验操作过程中带入的,样本中本身包含的抑制剂如血液中的血红蛋白,植物样本中的腐殖酸、黄腐酸等;在实验过程中带入的抑制剂如EDTA、肝素、氯酚仿等。不同样本中可能引入的抑制剂或其他污染物会不一样,详见 DNA/RNA Isolation Considerations When Using TruSeq Library Preparation 。 如果样本中存在这些抑制剂等污染物质的话,需结合试剂盒进一步进行纯化,比如过柱子过滤的试剂盒等,达到总RNA理想标准方可开展后续实验。 总RNA提取结果检测标准: 总RNA溶解环境:ph7.5-8.0; 结合Qubit or Pico/RiboGreen/Agilent 2100进行检测; Substance Absorbance (nm) 260/280 Ratio Values 260/230 Ratio Values Pure DNA 280 nm ~1.8 2.0–2.2 Pure RNA 280 nm ~2.0 2.0–2.2 EDTA, Carbohydrates, Phenol 230 nm 1.5 2.0 Guanidine HCL 230 nm 1.5 2.0 2.关于去除rRNA 考虑到总RNA中含有大量的rRNA序列,大约是在80%-90%的序列是rRNA,因而会结合不同的方法来去除总RNA中的rRNA。真核生物种常规的去除rRNA的方法是通过oligo(dT)富集带有polyA尾的mRNA来实现的,但是这种方法针对不含有polyA尾的转录本序列以及存在部分降解的总RNA样本,所以这种方法针对FF( Formalin-Fixed )样本和FFPE ( Paraffin-Embedded ) 石蜡包埋 样本是不适用的,否则对获得样本中最全面的转录本信息会产生显著影响。 针对于FF和FFPE样本以及原核生物的总RNA中去除rRNA,则需结合 RiboZero、RiboMinus等是结合来开展去除,其实针对rRNA序列进行杂交捕获去除的原理来去除的。针对FFPE样本还有结合双链特异性核酸酶构建文库来降低后续测序数据中的rRNA序列比例的。 常见去除rRNA方法: a. rRNA 消减杂交法:相应的试剂盒有 MICROBExpress bacterial mRNA enrichment kit (Ambion) , RiboMinus bacteria transcriptome isolation kit (Invitrogen) 和 Ribo-Zero rRNA removal kit (Epicentre) ; b. 5′ 单核苷酸依赖的外切酶处理法:相应的试剂盒主要有 mRNA-ONLY prokaryotic mRNA isolation kit (Epicentre) ; c. 选择性引物扩增法:相应试剂盒主要有 Ovation prokaryotic RNA-seq system (NuGEN) ; d. 依赖于双链特异核酸酶的 cDNA 均一化法:相应的试剂盒主要有 trimmer-direct cDNA normalization kit(Evroge n) ; e. 大肠杆菌 poly(A) 聚合酶加尾法:相应的试剂盒有 MessageAmp II-bacteria kit (Ambion) ; 与 RNA 结合蛋白 Hfq 等免疫共沉淀法,由于 Hfq 能够高效地结合 small RNA ,并能辅助它们与靶标 mRNA 结合,因此常用于 small RNA 及其靶标 mRNA 的研究。 3. 关于文库构建 针对去除rRNA之后获得的mRNA进行构建文库,通常有两种思路: a. 先对mRNA结合oligo(dT ) 进行反转录,再针对cDNA进行fragmentation; b. 先mRNA fragmentation再结合随机引物进行反转录。 这两种方法获得的结果会有很多差异:a.蓝线;b.红线。 上图显示先针对mRNA进行打断再进行反转录获得测序reads主要是针对基因本体的;若先反转录,尤其是结合oligo(dT)进行反转录获得的测reads对转录本3'端具有比较强的偏好性,所以在mRNA-Seq中建议采用先对mRNA打断再进行反转录的文库构建方法。 根据mRNA文库构建类别,又分为常规的mRNA文库构建、均一化文库构建(引入双链特异性核酸酶)、全长cDNA文库以及链特异性文库构建(引入dUTP替换合成第二链中的dTTP)等,需根据具体的研究目的来选择,均一化文库构建可获得文库中低丰度表达基因信息、链特异性文库可获得正反向链上的转录本信息及可变剪切信息等。 Macrogen 千年基因 针对结合NGS平台测序RNA文库要求等详细信息汇总如下: *上述表格针对总RNA以及mRNA、病毒ssRNA的情况均有列出,供参考。 附参考文献( 如有什么问题欢迎随时**我ttwu@macrogencn.com ,谢谢! ): 1. Influence of RNA extraction methods and library selection schemes on RNA-seq data ; 2. IVT-seq reveals extreme bias in RNA-sequencing ; 3. Ribosomal RNA depletion for massively parallel bacterial RNA-sequencing applications ; 4. Comprehensive comparative analysis of RNA sequencing methods for degraded or low input samples ; 5. illumina support ; 6. Macrogen 千年基因support ; 7. Prokaryotictranscriptomics: a new view on regulation, physiology and pathogenicity ; 8. Efficientand robust RNA-seq process for cultured bacteria and complex communitytranscriptomes ; 9. Aperspective: metatranscriptomics as a tool for the discovery of novelbiocatalysts ; 10. Deepsequencing analysis of small noncoding RNA and mRNA targets of the globalpost-transcriptional regulator ; 11. Globalanalysis of small RNA and mRNA targets of Hfq ; 12. Validationof two ribosomal RNA removal methods for microbial metatranscriptomics ; 13. RNA-Seq a revolutionary tool for transcriptomics.pdf 。
个人分类: 转录组测序|19521 次阅读|1 个评论
How to calculate FPKM values of interested genes
ginseachen 2014-7-3 10:21
FPKM, Fragments Kilobase of exon model per millon mapped reads, which can be used to indicate the expression (abundance) characteristics of genes. Now I will describe operation about obtaining interested gene FPKM value. 1.Software Download 1).fastq-dump: convert sra file to fastq file. website: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software 2).bowtie:an ultrafast and memory efficient tool for aligning sequencing reads to long reference sequences. website: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml 3).cufflinks:assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. website: http://cufflinks.cbcb.umd.edu/ 4).gffread: convert gff3 file to gtf file. website: http://cufflinks.cbcb.umd.edu/ (This program is included with cufflinks package) 2. Operation 1) Download genome.fa and genes.gff3 file from genome website; Download sra file from NCBI 2) Format conversion $ fastq-dump -I --split-files SRR123456789.sra # convert sra file to fastq file $ gffread -E genes.gff3 -o genes.gtf # convert gff3 file to gtf file 3) Index files $bowtie2-build genome.fa genome 4) Alignment $bowtie2 -x genome -1 SRR123456789_1.fastq -2 SRR123456789_2.fastq -S SRR123456789.sam $samtools view -bS SRR123456789.sam SRR123456789.bam $samtools sort SRR123456789.bam SRR123456789 5) FPKM values $cufflinks SRR123456789.bam -G genes.gtf -o result After these operations, we can extract FPKM values from genes.frkm_tracking file based on gene ID. Notes: If you find some bugs, please contact me.
5552 次阅读|0 个评论
[转载]美国奥本大学沟鲶BSR-Seq文章
jackiehu 2014-3-17 10:14
BMC Genomics. 2013 Dec 30;14(1):929. Bulk segregant RNA-seq reveals expression and positional candidate genes and allele-specific expression for disease resistance against enteric septicemia of catfish. 关键词:BSR-Seq, Bulked segregant RNA-seq. 分离群体分组转录组测序,联合分离群体分组分析(BSA)和基于NGS的转录组测序(RNA-seq)技术。 物种与疾病:ESC, enteric septicemia of catfish (斑点叉尾鱼回肠型败血症)。斑点叉尾鮰,又称沟鲶、钳鱼,原产于北美洲,一种大型淡水鱼类。 实验材料:4个BC1家系(抗病X感病F1与感病亲本回交),每条鱼35g左右。对照群体:每个家系100条,共400条鱼。处理群体:每个家系300条,共1200条鱼,注入1000ml爱德华氏细菌(4×108CFU/ml)。 取材:3-5天后,从2个家系群中收集死去个体为敏感鱼。2周后,同样的2个家系群中收集所有活着的个体为抗病鱼。于此同时,收集对照群体的鱼。 测序方案:抗性组、敏感组、和对照组;每个组每个家系选12条,共24条鱼。3组共72条鱼,每条鱼取相同重量的肝脏组织,抽提RNA。每个组的个体所提出的RNA,等量混合,Truseq文库构建,PE100转录组测序。 Trinity软件进行de novo组装。 转录组测序数据量为:抗性组,151M reads;敏感组,116M reads;对照组,132M reads。 共得clead reads为374 M,de novo组装成232338条非冗余contigs,均长825bp。 其中5.1万多条长度大于1Kb。 基因差异表达分析发现,抗性组比较对照组时,224个基因差异表达,其中130个上调。 而差异倍数大于10倍的基因有42个。 敏感组对比对照组时,总计有1240个基因差异表达,其中差异倍数在10倍以上的有233个。 抗性组与敏感组比较时,差异表达基因有1255个,在抗性组中上调表达的528个基因中, 有4个差异倍数大于100;19个差异倍数在50-100之间;86个差异倍数在10-50之间。 在抗性组中表达降低的基因中,2个超过100倍;10个在50-100之间;86个在10-50之间。 使用Popoolation软件分析,鉴定出513371个SNPs; 使用VarScan软件,鉴定到482035个SNPs,其中两个软件共同鉴定到465537个SNPs, 这些SNPs位于31646个contigs之中。 位于11249个contigs之中的56419个显著SNPs在抗性组和敏感组之间具有显著的等位频率差异。 而这11249个contigs中,5480个可以比对上已知基因,代表了4304个unique基因。 分析抗性组和敏感组的RNA-seq数据,得到分离群体频率比(Bulk frequency ratios,BFR)。 大量含有标志SNPs的基因的BFR值大于2;总计359个基因的BFR值等于大于4; 其中337个基因的BFR值在4-16之间;23个基因大于16;还有4个基因的BFR大于32. BFR值大于等于4的359个基因,所含有的组合分离等位比率在7以下,其中大都是具有1-3个 组合分离等位比率,这表明BFR值大的这些基因并不是因为等位基因特异性表达, 而可能是遗传分离所到导致的。 共有354个BFR值大的基因被鉴定出含有显著SNPs,BLASTN搜索鲶鱼基因组草图,这354个 基因在201个scaffolds之中,其中134个被定位到连锁群上。 共有354个BFR值大的基因被鉴定出含有显著SNPs,BLASTN鲶鱼基因组草图(未发表), 发现这些基因位于201个scaffolds之中,其中134个scaffolds已被定位到连锁群上。 鉴定到8条染色体上含有沟鲶抗病性QTLs (基因数目5或含有BFR10的基因), LG6、15和17上基因数目最多,而6条染色体含有BFR值10的基因。 一般来说,蓝鲶的ESC抗病性要强于沟鲶。本研究98个SNPs组合等位比率大于14的基因中, 18个基因的亲本起源已知,11个基因起源自斑点叉尾鮰/沟鲶(channel catfish), 7个基因起源自长鳍叉尾鮰/蓝鲶(blue catfish)。 11个优先表达沟鲶等位基因的基因中,6个在抗性鱼中高表达,5个在敏感鱼中高表达。 而7个优先表达蓝鲶等位基因的基因中,4个在抗性鱼中高表达,3个在敏感鱼中高表达。 本文经以上分析共找出17个同时具有高BFR值和低等位基因比的差异表达基因,这17个基因是候选抗病关键基因。 本文转自:http://www.bgitechsolutions.cn/bbs/forum.php?mod=viewthreadtid=545
个人分类: 遗传|3822 次阅读|0 个评论
[转载]List of RNA-Seq bioinformatics tools
chuangma2006 2013-8-16 09:01
List of RNA-Seq bioinformatics tools From Wikipedia, the free encyclopedia RNA-Seq ( RNA-Seq.ppt / RNA-Seq Guide ) is a revolutionary technique to perform transcriptome studies based on next-generation sequencing technologies. This technique is largely dependent on bioinformatics tools developed to support the different steps of the process. Here are listed some of the principal tools commonly employed and links to some related web resources. To follow an integrated guide to the analysis of RNA-seq data, please see - Next Generation Sequencing (NGS)/RNA , Hands-On Tutorial or RNA-Seq Workflow . Also, important links are SEQanswers , RNA-SeqList , RNA-SeqBlog , Biostar and bioscholar . Contents 1 Quality control and pre-processing data 1.1 Quality control and filtering data 1.2 Pre-processing data 2 Alignment Tools 2.1 Short (Unspliced) aligners 2.2 Spliced aligners 2.2.1 Aligners based on known splice junctions (annotation-guided aligners) 2.2.2 De novo Splice Aligners 2.2.2.1 De novo Splice Aligners that also use annotation optionally 2.2.2.2 Other Spliced Aligners 3 Quantitative analysis and Differential Expression 3.1 Multi-tool solutions 4 Workbench (analysis pipeline / integrated solutions) 4.1 Commercial Solutions 4.2 Open Source Solutions 5 Fusion genes/chimeras/translocation finders/structural variations 6 Copy Number Variations identification 7 RNA-Seq simulators 8 Transcriptome assemblers 8.1 Genome-Guided assemblers 8.2 Genome-Independent assemblers 9 Visualization tools 10 Functional, Network Pathway Analysis Tools 11 Further annotation tools for RNA-Seq data 12 RNA-Seq Databases 13 Webinars and Presentations 14 References Quality control and pre-processing data Quality control and filtering data Quality assessment is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases (trimming), linkers, overrepresented sequences or noise to assure a coherent final result. clean_reads clean_reads . condetri condetri . cutadapt cutadapt removes adapter sequences from next-generation sequencing data (Illumina, SOLiD and 454). It is used especially when the read length of the sequencing machine is longer than the sequenced molecule, like the microRNA case. FastQC FastQC is a quality control tool for high-throughput sequence data ( Babraham Institute ) and is developed in Java . Import of data is possible from FastQ files, BAM or SAM format. This tool provides an overview to inform about problematic areas, summary graphs and tables to rapid assessment of data. Results are presented in HTML permanent reports. FastQC can be run as a stand alone application or it can be integrated into a larger pipeline solution. See also seqanswers/FastQC . FASTX FASTX Toolkit is a set of command line tools to manipulate reads in files FASTA or FASTQ format. These commands make possible preprocess the files before mapping with tools like Bowtie. Some of the tasks allowed are: conversion from FASTQ to FASTA format, information about statistics of quality, removing sequencing adapters, filtering and cutting sequences based on quality or conversion DNA / RNA . Flexbar Flexbar performs removal of adapter sequences, trimming and filtering features. FreClu FreClu improves overall alignment accuracy performing sequencing-error correction by trimming short reads, based on a clustering methodology. HTSeq HTSeq . htSeqTools htSeqTools is a Bioconductor package able to perform quality control, processing of data and visualization. htSeqTools makes possible visualize sample correlations, to remove over-amplification artifacts, to assess enrichment efficiency, to correct strand bias and visualize hits. PRINSEQ PRINSEQ generates statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, odds ratios. qrqc qrqc is a Bioconductor package to quick read quality control. RNA-SeQC RNA-SeQC is a tool with application in experiment design, process optimization and quality control before computational analysis. Essentially, provides three types of quality control: read counts (such as duplicate reads, mapped reads and mapped unique reads, rRNA reads, transcript-annotated reads, strand specificity), coverage (like mean coverage, mean coefficient of variation, 5’/3’ coverage, gaps in coverage, GC bias) and expression correlation (the tool provides RPKM-based estimation of expression levels). RNA-SeQC is implemented in Java and is not required installation, however can be run using the GenePattern web interface. The input could be one or more BAM files. HTML reports are generated as output. RSeQC RSeQC analyzes diverse aspects of RNA-Seq experiments: sequence quality, sequencing depth, strand specificity, GC bias, read distribution over the genome structure and coverage uniformity. The input can be SAM, BAM, FASTA, BED files or Chromosome size file (two-column, plain text file). Visualization can be performed by genome browsers like UCSC, IGB and IGV. However, R scripts can also be used to visualization. Sabre sabre . SAMStat SAMStat identifies problems and reports several statistics at different phases of the process. This tool evaluates unmapped, poorly and accurately mapped sequences independently to infer possible causes of poor mapping. Scythe scythe . SEECER seecer SEECER is a sequencing error correction algorithm for RNA-seq data sets. It takes the raw read sequences produced by a next generation sequencing platform like machines from Illumina or Roche. SEECER removes mismatch and indel errors from the raw reads and significantly improves downstream analysis of the data. Especially if the RNA-Seq data is used to produce a de novo transcriptome assembly, running SEECER can have tremendous impact on the quality of the assembly. Sickle Sickle . ShortRead ShortRead is a package provided in the R (programming language) / BioConductor environments and allows input, manipulation, quality assessment and output of next-generation sequencing data. This tool makes possible manipulation of data, such as filter solutions to remove reads based on predefined criteria. ShortRead could be complemented with several Bioconductor packages to further analysis and visualization solutions ( BioStrings , BSgenome , IRanges , and so on). See also seqanswers/ShortRead . SysCall SysCall is a classifier tool to identification and correction of systematic error in high-throughput sequence data. Trimmomatic Trimmomatic performs trimming for Illumina platforms and works with FASTQ reads (single or pair-ended). Some of the tasks executed are: cut adapters, cut bases in optional positions based on quality thresholds, cut reads to a specific length, converts quality scores to Phred-33/64. Pre-processing data Further tasks performed before alignment. DeconRNASeq DeconRNASeq is an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data. FastQ Screen FastQ Screen screens FASTQ format sequences to a set of databases to confirm that the sequences contain what is expected (such as species content, adapters, vectors, etc). FLASH FLASH is a read pre-processing tool. FLASH combines paired-end reads which overlap and converts them to single long reads. IDCheck IDCheck . Alignment Tools After control assessment, the first step of RNA-Seq analysis involves alignment (RNA-Seq alignment) of the sequenced reads to a reference genome (if available) or to a transcriptome database. See also List of sequence alignment software and HTS Mappers . Short (Unspliced) aligners Short aligners are able to align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types: 1) based on the Burrows-Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods, Needleman-Wunsch or Smith-Waterman algorithms. The first group (Bowtie and BWA) is many times faster, however some tools of the second group, despite the time spent tend to be more sensitive, generating more reads correctly aligned. See a comparative study of short aligners - comparative study . BFAST BFAST aligns short reads to reference sequences and presents particular sensitivity towards errors, SNPs, insertions and deletions. BFAST works with the Smith-Waterman algorithm. See also seqanwers/BFAST . Bowtie Bowtie is a fast short aligner using an algorithm based on the Burrows-Wheeler transform and the FM-index . Bowtie tolerates a small number of mismatches. See also seqanswers/Bowtie . Burrows-Wheeler Aligner (BWA) BWA implements two algorithms, mainly based on Burrows–Wheeler transform . The first algorithm is used with reads with low error rate (3%). The second algorithm was designed to handle more errors and implements a combined strategy: Burrows–Wheeler transform and Smith-Waterman method. BWA allows mismatches and small gaps (insertions and deletions). The output is presented in SAM format. See also seqanswers/BWA . Short Oligonucleotide Analysis Package (SOAP) SOAP . GNUMAP GNUMAP performs alignment using a probabilistic Needleman-Wunsch algorithm. This tool is able to handle alignment in repetitive regions of a genome without losing information. The output of the program was developed to make possible easy visualization using available software. Maq Maq first aligns reads to reference sequences and after performs a consensus stage. On the first stage performs only ungapped alignment and tolerates up to 3 mismatches. See also seqanswers/Maq . Mosaik Mosaik . Mosaik is able to align reads containing short gaps using Smith-Waterman algorithm , ideal to overcome SNPs, insertions and deletions. See also seqanswers/Mosaik . NovoAlign (commercial) NovoAlign is a short aligner to the Illumina platform based on Needleman-Wunsch algorithm. Novoalign tolerates up to 8 mismatches per read, and up to 7bp of indels. It is able to deal with bisulphite data. Output in SAM format. See also seqanswers/NovoAlign . RazerS RazerS . See also seqanswers/RazerS . SEAL SEAL uses a MapReduce model to produce distributed computing on clusters of computers. Seal uses BWA to perform alignment and Picard MarkDuplicates to detection and duplicate read removal. See also seqanswers/SEAL . SeqMap SeqMap . See also seqanswers/SeqMap . SHRiMP SHRiMP employs two techniques to align short reads. Firstly, the q-gram filtering technique based on multiple seeds identifies candidate regions. Secondly, these regions are investigated in detail using Smith-Waterman algorithm. See also seqanswers/SHRiMP . SMALT Smalt . Stampy Stampy combines the sensitivity of hash tables and the speed of BWA. Stampy is prepared to alignment of reads containing sequence variation like insertions and deletions. It is able to deal with reads up to 4500 bases and presents the output in SAM format. See also seqanswers/Stampy . ZOOM (commercial) ZOOM is a short aligner of the Illumina/Solexa 1G platform. ZOOM uses extended spaced seeds methodology building hash tables for the reads, and tolerates mismatches and insertions and deletions. See also seqanswers/ZOOM . Spliced aligners Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary - Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently. See also Methods to study splicing from high-throughput RNA Sequencing data and Methods to Study RNA-Seq (workflow) . Aligners based on known splice junctions (annotation-guided aligners) In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags (EST). Erange Erange is a tool to alignment and data quantification to mammalian transcriptomes. See also seqanswers/Erange . IsoformEx IsoformEx . MapAL MapAL . OSA OSA . RNA-MATE RNA-MATE is a computational pipeline for alignment of data from Applied Biosystems SOLID system. Provides the possibility of quality control and trimming of reads. The genome alignments are performed using mapreads and the splice junctions are identified based on a library of known exon-junction sequences. This tool allows visualization of alignments and tag counting. See also seqanswers/RNA-MATE . RUM RUM performs alignment based on a pipeline, being able to manipulate reads with splice junctions, using Bowtie and Blat. The flowchart starts doing alignment against a genome and a transcriptome database executed by Bowtie. The next step is to perform alignment of unmapped sequences to the genome of reference using BLAT. In the final step all alignments are merged to get the final alignment. The input files can be in FASTA or FASTQ format. The output is presented in RUM and SAM format. RNASEQR RNASEQR . See also seqanswers/RNASEQR . SAMMate SAMMate . See also seqanswers/SAMMate . SpliceSeq SpliceSeq . X-Mate X-Mate . De novo Splice Aligners De novo Splice aligners allow the detection of new Splice junctions without need to previous annotated information (some of these tools present annotation as a suplementar option). See also De novo Splice Aligners . ABMapper ABMapper . See also seqanswers/ABMapper . ContextMap ContextMap was developed to overcome some limitations of other mapping approaches, such as resolution of ambiguities. The central idea of this tool is to consider reads in gene expression context, improving this way alignment accuracy. ContextMap can be used as a stand-alone program and supported by mappers producing a SAM file in the output (e.g.: TopHat or MapSplice). In stand-alone mode aligns reads to a genome, to a transcriptome database or both. CRAC CRAC propose a novel way of analyzing reads that integrates genomic locations and local coverage, and detect candidate mutations, indels, splice or fusion junctions in each single read. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses. GSNAP GSNAP . See also seqanswers/GSNAP . HMMSplicer HMMSplicer can identify canonical and non-canonical splice junctions in short-reads. Firstly, unspliced reads are removed with Bowtie. After that, the remaining reads are one at a time divided in half, then each part is seeded against a genome and the exon borders are determined based on the Hidden Markov Model . A quality score is assigned to each junction, useful to detect false positive rates. See also seqanswers/HMMSplicer . MapSplice MapSplice . See also seqanswers/MapSplice . OLego OLego . See also seqanswers/OLego . PALMapper PALMapper . See also seqanswers/PALMapper . Pass Pass aligns gapped, ungapped reads and also bisulfite sequencing data. It includes the possibility to filter data before alignment (remotion of adapters). Pass uses Needleman-Wunsch and Smith-Waterman algorithms, and performs alignment in 3 stages: scanning positions of seed sequences in the genome, testing the contiguous regions and finally refining the alignment. See also seqanswers/Pass . PASSion PASSion . PASTA PASTA . QPALMA QPALMA predicts splice junctions supported on machine learning algorithms. In this case the training set is a set of spliced reads with quality information and already known alignments. See also seqanswers/QPALMA . SeqSaw SeqSaw . SoapSplice SoapSplice . SpliceMap SpliceMap . See also seqanswers/SpliceMap . SplitSeek SplitSeek . See also seqanswers/SplitSeek . SuperSplat SuperSplat was developed to find all type of splice junctions. The algorithm splits each read in all possible two-chunk combinations in an iterative way, and alignment is tried to each chunck. Output in “Supersplat” format. See also seqanswers/SuperSplat . Subread Subread is a superfast, accurate and scalable read aligner. It uses the seed-and-vote mapping paradigm to determine the mapping location of the read by using its largest mappable region. It automatically decides whether the read should be globally mapped or locally mapped. For RNA-seq data, Subread should be used for the purpose of expression analysis. Subread is very powerful in mapping gDNA-seq reads as well. See also seqanswers/Subread . Subjunc Subjunc is a specialized version of Subread. It uses all mappable regions in an RNA-seq read to discover exons and exon-exon junctions. It uses the donor/receptor signals to find the exact splicing locations. Subjunc yields full alignments for every RNA-seq read including exon-spanning reads, in addition to the discovered exon-exon junctions. Subjunc should be used for the purpose of junction detection and genomic variation detection in RNA-seq data. See also seqanswers/Subjunc . TrueSight TrueSight . De novo Splice Aligners that also use annotation optionally GEM . MapNext MapNext . See also seqanswers/MapNext . STAR STAR is an ultrafast tool that employs “sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure”, detects canonical, non-canonical splices junctions and chimeric-fusion sequences. It is already adapted to align long reads (third-generation sequencing technologies) and can reach speeds of 45 million paired reads per hour per processor. See also seqanswers/STAR . TopHat TopHat is prepared to find de novo junctions. TopHat aligns reads in two steps. Firstly, unspliced reads are aligned with Bowtie. After, the aligned reads are assembled with Maq resulting islands of sequences. Secondly, the splice junctions are determined based on the initially unmapped reads and the possible canonical donor and acceptor sites within the island sequences. See also seqanswers/TopHat . Other Spliced Aligners G.Mo.R-Se G.Mo.R-Se is a method that uses RNA-Seq reads to build de novo gene models. Quantitative analysis and Differential Expression These tools calculate the abundance of each gene expressed in a RNA-Seq sample (see also Quantification models ). Some softwares are also designed to study the variability of genetic expression between samples (differential expression). Quantitative and differential studies are largely determined by the quality of reads alignment and accuracy of isoforms reconstruction. See a comparative study of differential expression methods and Which method should you use for normalization of rna-seq data? . Alexa-Seq Alexa-Seq is a pipeline that makes possible to perform gene expression analysis, transcript specific expression analysis, exon junction expression and quantitative alternative analysis. Allows wide alternative expression visualization, statistics and graphs. See also seqanswers/Alexa-Seq . ASC ASC . See also seqanswers/ASC . BaySeq BaySeq is a Bioconductor package to identify differential expression using next-generation sequencing data, via empirical Bayesian methods . There is an option of using the “snow” package for parallelisation of computer data processing, recommended when dealing with large data sets. See also seqanswers/BaySeq . BBSeq BBSeq . See also seqanswers/BBSeq . BitSeq BitSeq . CEDER CEDER . CPTRA CPTRA . casper casper is a Bioconductor package to quantify expression at the isoform level. It combines using informative data summaries, flexible estimation of experimental biases and statistical precision considerations which (reportedly) provide substantial reductions in estimation error. Cufflinks Cufflinks is appropriate to measure global de novo transcript isoform expression. It performs assembly of transcripts, estimation of abundances and determines differential expression and regulation in RNA-Seq samples. See also seqanswers/Cufflinks . DESeq DESeq is a Bioconductor package to perform differential gene expression analysis based on negative binomial distribution. See also seqanswers/DESeq . DEGSeq DEGSeq . See also seqanswers/DEGSeq . DEXSeq DEXSeq is Bioconductor package that finds differential differential exon usage based on RNA-Seq exon counts between samples. DEXSeq employs negative binomial distribuition, provides options to visualization and exploration of the results. DiffSplice DiffSplice is a method for differential expression detection and visualization, not dependent on gene annotations. This method is supported on identification of alternative spicing modules (ASMs) that diverge in the different isoforms. A non-parametric test is applied to each ASM to identify significant differential transcription with a measured false discovery rate. EBSeq EBSeq . EdgeR EdgeR is a R package for analysis of differential expression of data from DNA sequencing methods, like RNA-Seq, SAGE or ChIP-Seq data. edgeR employs statistical methods supported on negative binomial distribution as a model for count variability. See also seqanswers/EdgeR . eXpress eXpress performance includes transcript-level RNA-Seq quantification, allele-specific and haplotype analysis and can estimate transcript abundances of the multiple isoforms present in a gene. Although could be coupled directly with aligners (like Bowtie), eXpress can also be used with de novo assemblers and thus is not needed a reference genome to perform alignment. It runs on Linux, Mac and Windows. ERANGE ERANGE performs alignment, normalization and quantification of expressed genes. See also seqanswers/ERANGE . featureCounts featureCounts an efficient read summarization/quantification program. It is implemented in both SourceForge Subread package and Bioconductor Rsubread package . FDM FDM GPSeq GPSeq MATS MATS . MISO MISO quantifies the expression level of splice variants from RNA-Seq data and is able to recognize differentially regulated exons/isoforms across different samples. MISO uses a probabilistic method (Bayesian inference) to calculate the probability of the reads origin. See also seqanswers/MISO . MMSEQ MMSEQ is a pipeline for estimating isoform expression and allelic imbalance in diploid organisms based on RNA-Seq. The pipeline employs tools like Bowtie, TopHat, ArrayExpressHTS and SAMtools. Also, edgeR or DESeq to perform differential expression. See also seqanswers/MMSEQ . Myrna Myrna is a pipeline tool that runs in a cloud environment ( Elastic MapReduce ) or in a unique computer for estimating differential gene expression in RNA-Seq datasets. Bowtie is employed for short read alignment and R algorithms for interval calculations, normalization, and statistical processing. See also seqanswers/Myrna . NEUMA NEUMA is a tool to estimate RNA abundances using length normalization, based on uniquely aligned reads and mRNA isoform models. NEUMA uses known transcriptome data available in databases like RefSeq . NOISeq NOISeq . See also seqanswers/NOISeq . NSMAP NSMAP allows inference of isoforms as well estimation of expression levels, without annotated information. The exons are aligned and splice junctions are identified using TopHat. All the possible isoforms are computed by combination of the detected exons. RNAeXpress RNAeXpress Can be run with Java GUI or command line on Mac, Windows and Linux. Can be configured to perform read counting, feature detection or GTF comparison on mapped rnaseq data. rSeq rSeq RSEM RSEM . See also seqanswers/RSEM . rQuant rQuant is a web service ( Galaxy (computational biology) installation) that determines abundances of transcripts per gene locus, based on quadratic programming . rQuant is able to evaluate biases introduced by experimental conditions. A combination of tools is employed: PALMapper (reads alignment), mTiM and mGene (inference of new transcripts). Scotty Scotty Performs power analysis to estimate the number of replicates and depth of sequencing required to call differential expression. SpliceTrap SpliceTrap . SplicingCompass SplicingCompass . Multi-tool solutions DEB DEB is a web-interface/pipeline that permits to compare results of significantly expressed genes from different tools. Currently are available three algorithms: edgeR, DESeq and bayseq. Workbench (analysis pipeline / integrated solutions) See also an interesting blog 16 rna-seq tools you have to consider for your analysis pipeline . Commercial Solutions Avadis NGS Avadis NGS . CLC Genomics Workbench CLC Genomics Workbench DNASTAR DNASTAR GeneSpring GX GeneSpring GX geospiza geospiza Golden Helix Golden Helix NextGENe NextGENe Partek Partek Open Source Solutions ArrayExpressHTS ArrayExpressHTS (and ebi_ArrayExpressHTS ) is a BioConductor package that allows preprocessing, quality assessment and estimation of expression of RNA-Seq datasets. It can be run remotely at the European Bioinformatics Institute cloud or locally. The package makes use of several tools: ShortRead (quality control), Bowtie, TopHat or BWA (alignment to a reference genome), SAMtools format, Cufflinks or MMSEQ (expression estimation). See also seqanswers/ArrayExpressHTS . BiNGS!SL-seq . Chipster Chipster . easyRNASeq easyRNASeq . ExpressionPlot ExpressionPlot . FX FX . Galaxy : Galaxy is a general purpose workbench platform for computational biology. There are several publicly accessible Galaxy servers that support RNA-Seq tools and workflows, including NBIC's Andromeda , the CBIIT-Giga server, the Galaxy Project's public server , the GeneNetwork Galaxy server , the University of Oslo 's Genomic Hyperbrowser , URGI 's server (which supports S-MART), and many others. GENE-Counter GENE-Counter is a Perl pipeline for RNA-Seq differential gene expression analyses. Gene-counter performs alignments with CASHX, Bowtie, BWA or other SAM output aligner. Differential gene expression is run with three optional packages (NBPSeq, edgeR and DESeq) using negative binomial distribution methods. Results are stored in a MySQL database to make possible additional analyses. GenePattern GenePattern offers integrated solutions to RNA-Seq analysis ( Broad Institute ). GeneProf GeneProf : Freely accessible, easy to use analysis pipelines for RNA-seq and ChIP-seq experiments. MultiExperiment Viewer (MeV) MeV is suitable to perform analysis, data mining and visualization of large-scale genomic data. The MeV modules include a variety of algorithms to execute tasks like Clustering and Classification, Student's t-test , Gene Set Enrichment Analysis or Significance Analysis. MeV runs on Java . See also seqanswers/MeV . NGSUtils NGSUtils . RobiNA RobiNA provides a user graphical interface to deal with R/BioConductor packages. RobiNA provides a package that automatically installs all required external tools (R/Bioconductor frameworks and Bowtie ). This tool offers a diversity of quality control methods and the possibility to produce many tables and plots supplying detailed results for differential expression. Furthermore, the results can be visualized and manipulated with MapMan and PageMan . RobiNA runs on Java version 6. S-MART S-MART handles mapped RNA-Seq data, and performs essentially data manipulation (selection/exclusion of reads, clustering and differential expression analysis) and visualization (read information, distribution, comparison with epigenomic ChIP-Seq data). It can be run on any laptop by a person without computer background. A friendly graphycal user interface makes easy the operation of the tools. See also seqanswers/S-MART . Taverna Taverna . wapRNA wapRNA . Fusion genes/chimeras/translocation finders/structural variations Genome arrangements result of diseases like cancer can produce aberrant genetic modifications like fusions or translocations. Identification of these modifications play important role in carcinogenesis studies. BreakDancer BreakDancer . See also seqanswers/BreakDancer . ChimeraScan ChimeraScan . EBARDenovo EBARDenovo . FusionAnalyser FusionAnalyser . FusionCatcher FusionCatcher . FusionHunter FusionHunter identifies fusion transcripts without depending on already known annotations. It uses Bowtie as a first aligner and paired-end reads. See also seqanswers/FusionHunter . FusionMap FusionMap . FusionSeq FusionSeq . See also seqanswers/FusionSeq . SOAPFuse SOAPFuse . SOAPfusion Soapf usion . TopHat-Fusion TopHat-Fusion is based on TopHat version and was developed to handle reads resulting from fusion genes. It does not require previous data about known genes and uses Bowtie to align continuous reads. See also seqanswers/TopHat-Fusion . ViralFusionSeq ViralFusionSeq is high-throughput sequencing (HTS) tool for discovering viral integration events and reconstruct fusion transcripts at single-base resolution. See also hkbic/VFS and SEQWiki/VFS . DeFuse DeFuse . Copy Number Variations identification CNVseq CNVseq detects copy number variations supported on a statistical model derived from array-comparative genomic hybridization . Sequences alignment are performed by BLAT, calculations are executed by R modules and is fully automated using Perl. See also seqanswers/CNVseq . RNA-Seq simulators These Simulators generate in silico reads and are a useful tool to compare and test the efficiency of algorithms developed to handle RNA-Seq data. Moreover, some of them make possible to analyse and model RNA-Seq protocols. BEERS Simulator BEERS is formatted to mouse or human data, and paired-end reads sequenced on Illumina platform. Beers generates reads starting from a pool of gene models coming from different published annotation origins. Some genes are chosen randomly and afterwards are introduced deliberately errors (like indels, base changes and low quality tails), followed by construction of novel splice junctions. dwgsim dwgsim . Flux simulator Flux Simulator implements a computer pipeline simulation to mimic a RNA-Seq experiment. All component steps that influence RNA-Seq are taken into account (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing) in the simulation. These steps present experimental attributes that can be measured, and the approximate experimental biases are captured. Flux Simulator allows joining each of these steps as modules to analyse different type of protocols. See also seqanswers/Flux . RSEM Read Simulator rsem-simulate-reads . RNASeqReadSimulator RNASeqReadSimulator contains a set of simple Python scripts, command line driven. It generates random expression levels of transcripts (single or paired-end), equally simulates reads with a specific positional bias pattern and generates random errors from sequencing platforms. RNA Seq Simulator RNA Seq Simulator . Transcriptome assemblers The transcriptome is the total population of RNAs expressed in one cell or group of cells, including non-coding and protein-coding RNAs. There are two types of approaches to assemble transcriptomes. Genome-guided methods use a reference genome (if possible a finished and high quality genome) as a template to align and assembling reads into transcripts. Genome-independent methods does not require a reference genome and are normally used when a genome is not available. In this case reads are assembled directly in transcripts. Genome-Guided assemblers Cufflinks Cufflinks . IsoInfer IsoInfer . IsoLasso IsoLasso . RNAeXpress RNAeXpress . Scripture Scripture . See also seqanswers/Scripture . Genome-Independent assemblers KISSPLICE KISSPLICE . Oases Oases . See also seqanswers/Oases . Rnnotator . SOAPdenovo SOAPdenovo . See also seqanswers/SOAPdenovo . Scaffolding Translation Mapping (STM) . Trans-ABySS Trans-AByss . See also seqanswers/Trans-ABySS . Trinity Trinity . See also seqanswers/Trinity . Velvet Velvet (algorithm) . Velvet(EMBL-EBI) . See also seqanswers/Velvet . Visualization tools Artemis Artemis . Apollo Apollo . EagleView EagleView . GBrowse GBrowse . Integrated Genome Browser IGB . Integrative Genomics Viewer (IGV) IGV . GenomeView genomeview . MapView MapView . Tablet Tablet Tbrowse- HTML5 Transcriptome Browser Tbrowse . Savant Savant Samscope Samscope . SeqMonk SeqMonk . See also seqanswers/SeqMonk . Vespa Vespa . Functional, Network Pathway Analysis Tools Ingenuity Systems (commercial) iReport IPA : Ingenuity’s IPA and iReport applications enable you to upload, analyze, and visualize RNA-Seq datasets, eliminating the obstacles between data and biological insight. Both IPA and iReport support identification, analysis and interpretation of differentially expressed isoforms between condition and control samples, and support interpretation and assessment of expression changes in the context of biological processes, disease and cellular phenotypes, and molecular interactions. Ingenuity iReport supports the upload of native Cuffdiff file format as well as gene expression lists. IPA supports the upload of gene expression lists. Gene Set Association Analysis for RNA-Seq ( GSAA-Seq ) : GSAA-Seq is a computational method that assesses the differential expression of a pathway/gene set between two biological states based on RNA-Seq data. Further annotation tools for RNA-Seq data seq2HLA seq2HLA is an annotation tool for obtaining an individual's HLA class I and II type and expression using standard NGS RNA-Seq data in fastq format. It comprises mapping RNA-Seq reads against a reference database of HLA alleles using bowtie , determining and reporting HLA type, confidence score and locus-specific expression level. This tool is developed in Python and R . It is available as console tool or Galaxy module. See also seqanswers/seq2HLA . HLAminer HLAminer is a computational method for identifying HLA alleles directly from whole genome, exome and transcriptome shotgun sequence datasets. HLA allele predictions are derived by targeted assembly of shotgun sequence data and comparison to a database of reference allele sequences. This tool is developed in perl and it is available as console tool. pasa pasa . RNA-Seq Databases queryable-rna-seq-database queryable-rna-seq-database . RNA-Seq Atlas RNA-Seq Atlas . SRA SRA . Webinars and Presentations RNASeq-Blog Presentations RNA-Seq Workshop Documentation (UC Davis University) VIDEO: Strategies for Identifying Biologically Compelling Genes from Breast Cancer Subtype RNA-Seq Profiles with Accompanying Analysis Princeton Workshop Youtube/RNA-Seq NGS Leaders COFACTOR genomics References ^ Wang Z, Gerstein M, Snyder M. (January 2009). RNA-Seq: a revolutionary tool for transcriptomics . Nature Reviews Genetics 10 (1): 57–63. doi : 10.1038/nrg2484 . PMC 2949280 . PMID 19015660 . ^ a b Yang Liao, Gordon K Smyth and Wei Shi (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote . Nucleic Acids Research 41 . doi : 10.1093/nar/gkt214 . PMID 23558742 . ^ http://bioinformatics.oxfordjournals.org/content/29/1/15.full ^ Cole Trapnell, Lior Pachter and Steven Salzberg (2009). TopHat: discovering splice junctions with RNA-Seq . Bioinformatics 25 (9): 1105–1111. doi : 10.1093/bioinformatics/btp120 . PMC 2672628 . PMID 19289445 . ^ Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold and Lior Pachter (2010). Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms . Nature Biotechnology 28 (5): 511–515. doi : 10.1038/nbt.1621 . PMC 3146043 . PMID 20436464 . ^ Zerbino DR, Birney E (2008). Velvet: Algorithms for de novo short read assembly using de Bruijn graphs . Genome Research 18 (5): 821–829. doi : 10.1101/gr.074492.107 . PMC 2336801 . PMID 18349386 . zz: http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools
个人分类: NGS|5796 次阅读|0 个评论
揭发国内RNA-seq测序的欺诈行为
热度 1 gaoshannankai 2013-6-15 01:47
1. 我看到很多公司建议客户做双端测序,其实单端就够了(深度高一点), 双端测序由于两个方向数据很难保证都好,还有其他问题,因此比单端并无明显优势; 他们建议双端,一来可以多收一倍的价格,另外可以把RNA-seq数据与其他的 DNA测序(这个必须是双端)集中到一个lane以减少成本; 没看我以前发表文章的不要误会,当前绝大部分软件没有充分使用paired的 信息,所以双端当单端处理,但是如果你有这方面经验,就不一样了。跑 龙套的可不行。 2.最好做strand-specific的,这个只是建库成本提高一点,但可以得到更多有用 信息; 3.有的公司会劝客户先测一个多来源(比如各组织叶、根、茎)样本的大数据(深度非常高), 然后再根据客户要求测(比如客户需要3个叶,3个根的样本),造成很大浪费。 其实只要测 3个叶,3个根对客户的问题就足够了。 4.强制绑定分析服务,大多数分析都是找几个本科生跑跑龙套,现在很多更专业的小公司 分析更便宜,质量更高。
6666 次阅读|1 个评论
再谈RNA-seq转录组数据分析的几个问题
热度 5 gaoshannankai 2013-5-31 04:11
上次谈到了 谈谈RNA-seq实验的几点经验,国内学者少走弯路 http://blog.sciencenet.cn/blog-907017-688359.html 其中很重要的就是质量控制和去除污染 alignment或拼接前,必须去除adapter、病毒、rRNA等污染序列。 alignment或拼接后, 还要仔细查看alignment的比例,以及实验重复之间的表达相关性 最近做的一个Tni转录组(一种虫子),通过拼接得到70458个转录本, 后来质量控制发现不对劲,就把数据blast到nt数据库,我的天 有11825条转录本来自大肠杆菌,这么多,肯定是污染,必须进一步去除 否则对后续分析影响很大,你比如说标准化。
6593 次阅读|8 个评论
谈谈RNA-seq实验的几点经验,国内学者少走弯路
热度 7 gaoshannankai 2013-5-10 03:39
谈谈RNA-seq实验的几点经验,国内学者少走弯路 随着illumina推出myseq等更便宜的设备,RNA-seq在国内使用越来越多 但是由于国内的垄断等原因,很多公司对客户进行误导 自己基于多个项目和大量数据处理的经验,给大家几点建议,仅供参考 1. 一定要坚定跟着illumina,尽管其他几款设备也有很大进步,但是差距很是不小的; 2. 100bp单向测序足够了,不用paired end为好,后者经常被用了宰客; 3. 组装还是trinity最好,华大那个更快 4. 组装前要对测序的数据做严格的质量控制,去污染(primer、病毒、rRNA) 5. 建议使用strand specific技术 因为RNA-seq制备方法必须反转录成 cDNA,因此丢失了转录体序列的方向性, 为了知道转录本来自具体哪条DNA,人们发明了strand specific。 strand-specific RNA-seq library的制备方法太多了,但是dUTP还是最成熟最可靠的 我们实验室有个protocal又快又便宜,你们可以参考一下 High-Throughput Illumina Strand-Specific RNA Sequencing Library Preparation Silin Zhong等
11950 次阅读|19 个评论
[转载]RNA-seq转录本拼接与重构的探讨
bioseq 2012-9-14 11:07
[转载]RNA-seq转录本拼接与重构的探讨
RNA进行测序一直以来都被认为是一种发现基因的有效方法,而且这种方法还被认为是对编码基因以及非编码基因进行注释的金标准。与以前的方法相比,大规模平行RNA测序方法(massively parallel sequencing of RNA)极大增强了RNA测序技术的处理能力,使我们得以能够对转录组进行测序。在本文中即将介绍到的这两种RNA测序方法就能以前所未有的精度对转录组进行分析。Trapnell小组使用的方法是一种名为Cufflinks的软件。这种软件能够随时发现小鼠生肌细胞(myoblast cell)内新出现的转录子,还能在细胞分化时对转录子表达水平进行监测,从而分析基因表达情况和剪接情况。Guttman小组也使用了与Trapnell小组相类似的软件方法,不过他们使用的是另一种名为Scripture的软件。Scripture软件可以对源自三个小鼠细胞系的转录组进行再注释(reannotate),从而对数百个最近新发现的lincRNA(large intergenic noncoding RNA)进行完整的基因模式注释。 虽然RNA测序技术已经出现了将近20年,但直到最近才开始构建克隆文库。对人类、小鼠以及其它重要模式生物进行全长基因克隆构建的科研项目需要几年的时间才能够完成。但是有了最新的测序技术,我们将不再需要构建克隆文库,可以直接对cDNA片段进行测序。我们现在可以只需要花费几天,仅用以往同类项目科研经费的很少一部分就能够得到一个比较满意的完整的细胞转录组。但是这种新技术也存在一点问题。不用构建克隆,我们就无法知道哪一个“结果(mRNA或蛋白)”来自哪一个转录子。最近已经有人开始通过对已知的或者预测出来的转录子的短RNA序列进行测序的方式来对基因表达和可变剪接进行分析研究。虽然这些研究可以得到很多信息,但是这种方法只能用于分析已知基因和对已知的可变连接区域进行分析。为了充分利用RNA序列数据进行生物学研究,我们还应该能够重建转录子并且还要能够在不借助参考注释基因组信息的情况下对这些转录子的相对丰度进行精确的测量。 过去我们在利用短RNA序列重建转录子时主要采用了两条策略(图1)。第一条策略是利用ABySS软件从头构建的方法,这样就可以与全长cDNA序列进行比对,从而解决序列注释的问题。这种办法还可以用于发现参考基因组中未收录或者收录不完全的转录子,还可以用于发现那些缺乏参考基因组RNA序列数据物种的转录子。不过这种利用小片段序列从头组装转录子的方法实施起来非常困难,只有丰度很高的转录子才有可能被成功组装。 RNA-Seq reads:短片段RNA序列;Align reads to genome:与基因组数据比对;Genome:基因组;Assemble transcripts de novo:从头组装转录子;More abundant:高丰度; Assemble transcripts from spliced alignments:通过与各种剪接方案比对组装转录子; Align transcripts to genome:将转录子与基因组进行比对;Less abundant:低丰度; 图1 利用RNA序列数据重建转录子的两种方法。图中左侧示意的先比对再组装的办法是Trapnell小组和Guttman小组采用的方法。该方法首先将短片断RNA序列与基因组序列进行比对,计算出所有可能的剪接方案,然后根据这些剪接方案重建出转录子。图中右侧展示的则是先组装再比对的方法。该方法先从根据RNA片段序列直接头合成出转录子序列,然后再用各种剪接方式对合成的转录子进行剪接,将剪接产物与基因组进行比对,找出内含子和外显子结构,以及各个不同剪接体之间的差异。由于这种从头合成的方法绝大部分情况下只对高丰度转录子管用,因此左侧图中展示的先比对再组装的策略要更为灵敏,不过这一观点尚需进一步论证。图中每个RNA片段都根据其来源转录子被标上了各种颜色。重建转录子中的蛋白编码区被标记成了深色。 第二种策略是先将每一个短片段RNA与基因组进行比对,然后再重建转录子。Trapnell小组和Guttman小组采用的就是这种策略。这两个小组使用的都是TopHat比对软件,通过该软件与基因组进行比对,获得了大量的剪接体。早期的RNA测序只能得到25~32个碱基长度的序列片段,现在我们可以得到75个碱基甚至更长的序列片段,这样就更容易进行序列比对,可以将片段末端固定在不同的外显子当中来判断哪种剪接体才是正确的,这样就不需要借助先前的注释信息了。通过上述这两种方法最终都能得到各种转录子图谱,再通过末端配对信息剔除掉不太可能的选择最终就能得到想要的转录子。 在使用哪种算法方面也是有很大差别的。比如Trapnell小组采用的Cufflinks软件就使用了一种非常严格的算术模型来发现每一个位点所有的可变调控转录子,还可以计算出每一种转录子的优势度。Guttman小组采用的Scripture软件则采用了统计学分段模型(statistical segmentation model)来区分表达位点和实验噪声。需要对Cufflinks软件技术、Scripture软件技术以及利用ABySS软件从头构建的方法进行大规模的测试,才能判断出哪一种方法在哪一种情况下面最为合适、有效。 令人吃惊的是,尽管我们已经利用数以百万计的EST和数千条完整的全长cDNA序列对小鼠基因组进行了详细的注释工作,但是Trapnell小组和Guttman小组还是发现了数千条以前从未发现过的转录子,其中包括已知基因的新型同工型转录子以及全新的编码基因及非编码基因的转录子。 Trapnell小组发现了3724条新的可信度非常高的已知基因的同工型转录子,这些转录子不论在人工注释的基因数据库还是在自动注释的基因数据库中都没有收录过。Trapnell小组还发现他们所发现的每一个转录子经过单独的表达验证之后都能对后续的分析起到重要的作用。实验表明,RNA测序工作能够在很大一个动态范围内准确地反映基因的表达情况,但是之前的实验都只能根据已知的同工型转录子或者预测的同工型转录子来进行判断。根据RNA片段的测序结果直接重建出所有的同工型转录子,然后再根据这些同工型转录子的出处将所有的配对片段进行分类,Trapnell小组用这种方法就能非常准确地判断出每一个基因的每一个同工型转录子的表达水平。他们还发现将每一个RNA片段正确地组装入转录子,能够极大的影响同一基因其它已知同工型转录子的预计表达水平。 如果能够检测每一个同工型转录子的表达水平,那么我们就能够对基因表达的调控机制进行更加深入的研究。这种调控机制可能发生在转录水平,比如形成具有不同转录起始位点的同工型转录子;也可以发生在转录后水平,比如虽然转录起始位点相同,但是内部剪接方式不同的各同工型转录子。Trapnell小组还发现,随着实验的推进,有大量基因的表达都会因为上面提到的这两种调控机制而发生明显的改变。这种能够在如此长时间段里对整个基因组表达调控情况进行检测的能力让我们能够进一步了解到基因组的新功能。比如,在这种水平上的详细数据能够让我们构建出更加合适的基因组调控网络模型,也可以利用这些数据根据每个基因各同工型转录子剪接情况与表达情况之间的关系来改变模型中的某些调控参数,而不用直接改变每一个基因的参数。 Guttman小组也发现了很多新的同工型剪接转录子,不过他们的工作重点主要都放到了各个新发现的转录子身上,尤其是lincRNA。之前利用芯片测序(ChIP-Seq)方法和全基因组瓦片芯片(whole-genome tiling array)方法发现了编码lincRNA的位点,但是由于分辨率不够因此不能构建出准确的模型。Guttman小组在Scripture软件的帮助下对609个已知位点构建出了基因模型,同时还发现了1000多个新的lincRNA,并解析了这些lincRNA的结构。Guttman小组还发现了469个蛋白编码基因的反义转录子。 通过为这些非编码RNA构建基因模型的方式能够让我们更好地开展基因功能研究了。比如Guttman小组就检测了各转录子的保守情况。与以前的观察结果一样,lincRNA要比内含子序列保守得多,但是其保守程度不如蛋白编码序列高。相反,反义转录子并不比编码蛋白的外显子区域的保守水平高,这说明这两种转录子各自具有不同的功能。RNA测序数据还能够展示非编码转录子的表达模式,这些数据表明lincRNA的丰度不仅要比蛋白编码RNA的丰度低,同时其表达水平也较低,而且同蛋白编码RNA相比,lincRNA的表达还具有非常明显的组织特异性。简单来说,如果能够更详细地了解非编码RNA的表达模式,为这些RNA构建出更为准确的基因模型,那么我们就能够更加清楚地知道它们在基因表达调控网络模型以及基因间相互作用模型中的作用,从而对它们的功能有更加深入的了解和认识。 Trapnell小组和Guttman小组发现了如此之多的新转录子这一事实不由得不让我们思考一个问题,为什么我们的注释工作会如此滞后呢?在Trapnell小组的试验中,已知的各种同工型转录子占到了总数的80%以上,这说明这些已知的转录子都来自高表达水平的基因,因此很容易通过以往的cDNA克隆测序方法给发现。Guttman小组的情况也基本相同。还有11%的RNA片段是来自已知基因新发现的同工型转录子,其中62%的片段都能与现有的EST或mRNA相印证,但是它们都还没有作为一个独立的转录子被注释。可能在以往的研究当中也零星的发现过这些低丰度的同工型转录子,可能只是因为它们与已知的转录子比较相似,或者是因为没能得到完整的测序,因此没有进行注释。与这种情况类似,被Guttman小组发现的lincRNA中有43%都可以在以往的小鼠cDNA研究工作中找到踪迹。由于lincRNA具有明显的组织特异性而以往的研究工作往往又只局限于研究某些组织,因此剩余的57%的lincRNA应该都是以前没有发现过的新的转录子。早期大规模RNA测序工作的重点都是针对蛋白编码区域,这可能也是我们注释工作显得落后的原因之一。Trapnell小组和Guttman小组采用的这种RNA测序方法能够明白无误地区分编码转录子和非编码转录子。 Trapnell小组使用的Cufflinks软件、Guttman小组使用的Scripture软件,以及其它一些类似的软件可以极大地改进我们的基因组注释工作,不论是被研究得非常详细的基因组还是缺乏EST和全长mRNA序列信息的基因组都能从中受益。但是利用RNA测序方法来进行基因注释工作也不是完美无缺的。通过Cufflinks软件和Scripture软件被发现的转录子中有大量的转录子都属于已知的转录子,只不过因为覆盖率较低所以都是不完整的转录子。正如用RNA测序方法重建的转录子很难与EST数据相吻合一样,很多低表达水平或者组织特异性表达的转录子也很难通过现有的RNA测序方法被发现。 随着测序技术的不断进步,我们也能够对转录组开展更为深入的测序工作,能够发现更多、更可靠的转录子。不过我们还需要更加先进的方法来区分低丰度的功能性转录子和背景噪声以及各种人为的假象。虽然Cufflinks和Scripture都是非常好的基因组注释工具,但针对不同的基因组(因为每个基因组的特点比如基因的密度、内含子的长度和含量、可变剪接发生的频率高低等等都不尽相同),我们仍然需要各种不同的软件(或算法)来更好地匹配并注释这些基因组。我们还需要看看Cufflinks和 Scripture在处理其它与小鼠基因组完全不同的基因组时表现如何。 大规模并行测序技术已经彻底改变了我们对基因组的研究方法,测序结果的质量也在不断提高,得到的信息量也在爆炸式增长。通过本文的介绍,我们也可以看到RNA测序技术以及转录子发现技术对于基因组注释工作以及基因组转录水平及转录后水平调控机制研究工作的重要意义。如果相应的软件能够及时跟上,那么RNA测序技术将有更大的成就。 原文检索: Brian J Haas Michael C Zody. (2010) Advancing RNA-Seq analysis, Nature Biotechnology , 28(5): 421-423.
6270 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-5-16 05:26

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部