AutoCirc从RNA-seq数据中自动鉴定潜在circRNA的反向剪接位点。核心算法是C++实现的,用Perl包装,在相同计算环境下,比其他circRNA检测算法更快。AutoCirc用来检测基因组已知的所有物种中的circRNAs。它不需要基因注释,但是基因注释会改善circRNAs的检测。 下载:https://github.com/chanzhou/AutoCirc 语言:C++/Perl 时间:20170829 参考: Genome-Wide Maps of m6A circRNAs Identify Widespread and Cell-Type-Specific Methylation Patterns that Are Distinct from mRNAs
salmon软件于2017年发表在Nature Methods,论文题目是“ Salmon provides fast and bias-aware quantification of transcript expression ” 。不到两年的时间被引250多次。 根据文中的描述,salmon要优于kallisto和express软件。 今天我们结合小麦最新的IWGSCv1.1版本的基因集,介绍下其定量流程,与kallisto进行一个定量结果的比较,以及看一看基因拷贝数变化时TPM如何变化。 首选要对转录本进行index salmon index --keepDuplicates -t IWGSC_v1.1_HC_LC_20170706_transcripts.fasta -i IWGSC_v1.1_HC_LC 下载测试数据,该测试数据来自中国春的叶片。 axel -a -n 12 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR947/SRR947005/SRR947005_1.fastq.gz axel -a -n 12 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR947/SRR947005/SRR947005_2.fastq.gz 进行定量 salmon quant -i IWGSC_v1.1_HC_LC -l A -1 SRR947005_1.fastq.gz -2 SRR947005_2.fastq.gz -p 8 -o SRR947005_quant -i 指要使用的转录组index 这里是我们第一步生成的IWGSCv1.1版本的转录组index; -l 这里需要指定测序模式,这里-A是指自动判定。 -1 -2 分别要输入左端测序数据和右端测序数据 -p 则是要指定运行程序的cpu线程数 上述定量命令使用测序的原始reads进行的,这里还有另外一种基于比对的方式,即输入文件是BAM或SAM文件。 输出的结果文件如下所示,其中quant.sf里有每个基因的表达量。 salmon定量输出文件 kallisto定量过程 首先也是要对转录组进行index kallisto index -i IWGSC_v1.1_HC_LC_kallisto IWGSC_v1.1_HC_LC_20170706_transcripts.fasta 其次是定量过程。 kallisto quant -i IWGSC_v1.1_HC_LC_kallisto -t 10 -o SRR947005_kallisto SRR947005_1.fastq.gz SRR947005_2.fastq.gz 同样的结果文件如下图所示,其中abundance.tsv里有每个基因的表达量。 kallisto定量结果输出文件 salmon文章比较的内容,这里不再比较。下面我们做一个比较初步的分析。 首先看TPM大于0, 0.5, 1, 10的基因数目。注意两者转录本的总数是一致的。 TPM 0 0.5 1 10 Salmon 109998 58690 43160 7172 kallisto 116087 60805 44721 7443 kallisto-Salmon 6089 2115 1561 271 从基因数量上来看,kallisto有更多的基因有TPM值,而且相同转录本获得的TPM值,kallisto获得的往往要大于salmon获得的。但这也不是绝对的,如下图。以转录本TraesCS3B02G080100.1为例,这里使用salmon获得的TPM值是338.82,而使用kallisto获得的是168.69。那么针对这一个基因,哪个更准确一点呢? image-20180902220909122 而查阅信息可知,主要是两者的 effecting length 不同导致,而count数量很相似。 Name Length EffectiveLength TPM(salmon) NumReads target_id length eff_length est_counts tpm(kallisto) TraesCS3B02G080100.1 171 35.605 338.816942 531.769 TraesCS3B02G080100.1 171 78.4896 532.115 168.686 This is the computed effective length of the target transcript. It takes into account all factors being modeled that will effect the probability of sampling fragments from this transcript, including the fragment length distribution and sequence-specific and gc-fragment bias (if they are being modeled). 而kallisto里的定义为: “Effective length” is a scaling of transcript length by the fragment length distribution 具体的可参见“https://haroldpimentel.wordpress.com/2014/05/08/what-the-fpkm-a-review-rna-seq-expression-units/”。 查过一些资料,越看越糊涂。不过,如果使用同一软件处理所有的数据应该是没有问题的。 因为IWGSC官方也有这个数据集的count和TPM值,我去查可一下,竟然和我的不一样。 transcript COUNT \0 COUNT TPM TraesCS3B02G080100.1 294.42 96.0559 而我这边给的count是532.115,tpm是168.686。事情变得复杂起来。我细想了原因,可能有两方面的原因。其一,我们使用的软件版本不同,其二我们使用的数据SRR947005,他们可能经过一些预处理。而我直接下下来使用,没有经过预处理。 关于软件版本,他们用的是0.42.3,而我用的是kallisto 0.44.0 。所以,接下来我们就要排除软件版本的差异。 To calculate the read counts and TPMs we used Kallisto version 0.42.3 . The index was built with the default values and K=31. Kallisto was run with the default parameters. All the fastq files from a single sample are run together. The reference used was RefSeqv1.1, including the High and Low quality genes (IWGSC_v1.1_ALL_20170706_transcripts.fasta) 结果有点让我接受不了。即如果我使用0.42.3版本,得到的count和TPM就和IWGSC上下载的一样了。但显然与我0.44.0的版本不一致。所以中间出现了什么变故? kallisto target_id length eff_length est_counts tpm(kallisto) 0.44.0 TraesCS3B02G080100.1 171 78.4896 532.115 168.686 0.42.3 TraesCS3B02G080100.1 171 79.8953 294.42 96.0563 有什么变故我们不得而知,但显然让我有点接受不了,我的结果里有多少这样任性的基因?做出来的差异基因有多少又是可信的?下面是作者的回答。软件作者建议使用最新版本。 image-20180903155107203 作者都这样说了,加上最新版本的count也与salmon一致,所以,这让本打算使用850 RNA-seq 样本做些分析的我有点迟疑。 第二部分,探究基因拷贝数对定量结果的影响 尽管最近science上转录组那篇文章也有相似的讨论,不过我还是想眼见为实。这个测试也非常简单,那就是人为增加某个基因的拷贝数,观察对这个基因本身的影响以及另外另个同源基因的影响。这里选取TraesCS3D02G074400.1 ,TraesCS3B02G080100.1 和TraesCS1D02G227700.1 。TraesCS3D02G074400.1是孤家寡人。而TraesCS3B02G080100.1则有多个同源基因,TraesCS3B02G080700.1, TraesCS3D02G067400.1, TraesCS3A02G067400.1, TraesCS3B02G080600.1, TraesCS3D02G067100.1, TraesCS7B02G443100.1, TraesCS3D02G067200.1, TraesCS3D02G219800.1, TraesCS3A02G233600.1 ;而TraesCS1D02G227700.1对应的有TraesCS1B02G240400.1,TraesCS1A02G226300.1,同时该基因还有可变剪切,TraesCS1D02G227700.2,TraesCS1B02G240400.2,TraesCS1A02G226300.2。 所以我们比较的重点就放在这几个基因上,看这几个基因的变化。 kallisto index -i IWGSC_v1 .1_HC_LC_kallisto_add IWGSC_v1 .1_HC_LC_add.fasta kallisto quant -i IWGSC_v1 .1_HC_LC_kallisto_add -t 10 -o SRR947005_kallisto_add SRR947005_1 .fastq.gz SRR947005_2 .fastq.gz 首先对这三个基因来说,其表达量减半。因为我们在序列上一字未改的复制了这3个序列。所以从结果看,复制的基因分享了一半的reads。 target_id length eff_length est_counts tpm(kallisto) add|TraesCS3B02G080100.1 171 79.2217 266.054 83.6109 TraesCS3B02G080100.1 171 79.2217 266.054 83.6109 add|TraesCS3D02G074400.1 752 561.072 4175.4 185.275 TraesCS3D02G074400.1 752 561.072 4175.4 185.275 add|TraesCS1D02G227700.1 1867 1675.73 284.687 4.22962 TraesCS1D02G227700.1 1867 1675.73 284.687 4.22962 同样的我们再比较多拷贝的情况。多拷贝的我们看这个基因TraesCS3B02G080100。复制之后,TraesCS3B02G080100的表达减半,其他拷贝几乎不变。 那么是否对可变剪切的转录本有影响呢?结果和上面一样,只是TraesCS1D02G227700.1自身TPM值减半。 既然拷贝数增加,只改变其自身的TPM值。那么同样的道理,如果减少一个拷贝数,其他拷贝数的TPM也不会发生显著变化。当然这里是有一个前提,那就是不同拷贝之间序列不是100%一致的。 刚才的复制我们一字未改,现在我们改变其中一个碱基,我们再观察结果如何。统一在这3个转录本序列的第80位改变一个碱基。理论上,改变之后这3个转录本的TPM值为0。 Kallisto的结果如下 target_id length eff_length est_counts tpm(kallisto) 1snp|TraesCS3D02G074400.1 752 561.072 0 0 1snp|TraesCS3B02G080100.1 171 79.2217 0 0 1snp|TraesCS1D02G227700.1 1867 1675.73 0.698188 0.010373 TraesCS1D02G227700.1 1867 1675.73 568.676 8.44886 TraesCS3B02G080100.1 171 79.2217 532.107 167.222 TraesCS3D02G074400.1 752 561.072 8350.8 370.551 如上表所示,实际情况也差不多如此,只有1snp|TraesCS1D02G227700.1有很小的TPM。 在改一个SNP的情况下,salmon表现似乎更好,修改的全部是0,而原来的TPM值也不变。 Name Length EffectiveLength TPM NumReads 1snp|TraesCS3D02G074400.1 752 550.209 0 0 1snp|TraesCS3B02G080100.1 171 35.447 0 0 1snp|TraesCS1D02G227700.1 1867 1665.119 0 0 TraesCS1D02G227700.1 1867 1665.119 7.729668 567.493 TraesCS3B02G080100.1 171 35.447 340.231283 531.757 TraesCS3D02G074400.1 752 550.209 343.102088 8323.484 所以总结来说,在同一个项目中,使用的软件版本要一致。至于哪一个salmon和kallisto哪个更好,我个人可能更偏向salmon,但kallisto毕竟是science使用的,所以做小麦的话,kallisto应该是公认的。kallisto算出来的差异基因数量要多于salmon。 但同时也要注意IWGSC注释的转录本的正确率。特别是对于低可信度的基因,基因结构不正确,做出来的TPM可能也不对。而且IWGSC注释的转录本其3’和5’的非翻译区有不少缺失。转录组里还可以组装出新的转录本或者新的可变剪切。 目前IWGSC给出的转录本是26万左右,而其中高可信的基因只有10万。我们进行差异表达分析时使用的是26个转录本,但只有那10万个高可信基因的功能注释较为可信。所以后续的功能富集分析,就要慎重。差异表达可以包括低可信的基因,但功能富集分析时可以不包括在里面,特别是不能用来充当总数,science上的那篇文章也正是这样做的。另外,功能富集分析时使用的背景基因应该是在某条件下表达的基因集,而不是总基因集。比如,总共有10个基因,其中只有5个基因表达,而这5个表达的基因里有2个差异表达,此时使用的总数是5而不是10。当然,如果差异基因里包括低可信的基因,那表达的总基因里也可以包括低可信度的基因。不过,我还是建议使用表达的高可信度的基因作为整体。 昨天我在推送里表达了在基因富集分析中背景基因如何选择的观点?我的观点是需要使用表达的基因,确切的说是使用表达的并且有功能注释的基因作为背景。比如,做GO分析,背景基因都得有GO注释信息,那些没有GO注释信息的显然不宜拿来当作背景。同样的拿来做GO分析的差异表达基因也得都GO注释信息,那些差异表达但没有GO注释信息的基因就不必包括在里边。为啥?道理很简单,我要做的是GO信息的富集分析,那些没有GO注释信息的基因即使放在里边也没毛用。 这是我的观点,也许不对。如果有大神跳出来吊打,我还是欢迎的。群里也有小伙伴发表自己的看法,看过之后很受启发。接下来,且听我细细道来。 这里所说的表达的基因,是指只要在分析所用样本中的一个里有表达(如TPM0.5就认为表达),比如在叶片和根里,只要基因在这两个组织中的任何一个里表达,我们就认为基因是表达的。即取的是不同样本间的并集。这里只是拿叶片和根说事,如果在文章里有人这样做,你可以将这篇文章扔到垃圾桶了。 同时我在google上也查了一些资料。在biostars(https://www.biostars.org/p/17628/)也有很多人在讨论。 基本的结论是,使用表达的基因作为背景是合适的,没有人明确反对这一做法不妥。 根据上面的回答,找到这篇文献“Multiple sources of bias confound functional enrichment analysis of global -omics data”。该文2015年发表在genome biology上。 文中的结论是: The generation of an estimated background ‘universe’ in RNA-seq data could be achieved by removing zero-count genes , but the nature of this ‘universe’ will still depend on many factors. This list should contain only factors (RNA or protein) that are both robustly ‘probed or sequenced’ (to avoid technology and detection bias) and ‘called’ as expressed (to avoid biological bias) in the experiment. After all, we do not want to carry out functional enrichment analysis of a specific tissue simply to be informed that we are studying that tissue! In conclusion, functional enrichment analysis must not be considered proof of biological plausibility or validity in the analysis of high-throughput -omics data. We strongly advocate for efforts to generate appropriate background expression ‘universes’. We also urge that background gene lists are provided for any functional enrichment analysis, and that a higher statistical threshold is used as a default, given the scale of the pre-existing biases, to avoid marginal (e.g., 1 × 10^−3^ or 1 × ^10−4^) enrichments being relied on to drive the interpretation of an experiment. 做差异基因功能富集分析的一定要看看本文。里边讨论了各种有可能导致最终结果不准确的情况。 一些在线进行GO富集分析的网站,如果不能自己背景序列,那就是耍流氓。 However, somewhat surprisingly, the Gene Ontology consortium website analysis tool does not allow for this option, making any analysis thoroughly unreliable. 实际上任何实验手段都不是完美的。我们现在做RNA-seq比较方便,RNA-seq并不是万能呢,蛋白水平就要比转录水平要好。还有组织毕竟是混杂,那单细胞结果又如何呢?代谢组呢? 通过差异表达基因功能富集结果我们可以说参与了某些通路,或者在哪些通路上富集。但最好不要说与哪些通路无关,但可以说这些通路上的基因没有富集。
本文转载自嘉因微信公众号,已获得授权。查看最新文章,敬请关注嘉因,微信ID:rainbow-genome 作者:小丫 来源: 嘉因 大家都会做方便面,有人做辛拉面,有人做三鲜伊面,工艺有何不同? 大家都会做RNA-seq,有人能筛出有意义的基因,有人能找出有价值的线索,有人。。。差别在哪? 前六期介绍了通过去除核糖体rRNA建库的数据处理、参考基因组的选择、均一化处理、差异基因筛选、画heatmap和富集分析的合理方法: 第六期:去除rRNA: 同样用Ensembl算TPM,结果还是不一样? | Ensembl的注意事项 第五期:gene model : 同样算read counts,为什么公司跟师兄算的结果不一样?| Ensembl、UCSC、Refseq,该用哪个 第一期:数据预处理 : 同一套RNA-seq,为什么公司做的跟师兄跑的结果不一样? | TPM、read counts、RPKM/FPKM你选对了吗? 第二期:差异基因筛选 : 同一套RNA-seq,公司筛出的差异基因跟师兄筛出的为什么不一样?| Pvalue, FDR, cutoff 第三期:heatmap : heatmap画不好会得出错误结论 | 数据预处理、聚类分析,HCL、 K means里的讲究 第四期:富集分析 : 富集分析,俩人做的结果差5岁 | 你用的注释文件有多老? 在 选哪个lncRNA的哪段设计引物?RT-qPCR验证,我要百发百中! 一文中,小丫谈到 用链特异性建库方式做RNA-seq,能够明确的看出read来自哪条链 ,即:来自蛋白编码基因,亦或是来自antisense上转录出来的lncRNA;而普通RT-qPCR无法区分来自不同链的RNA。 做了链特异性建库测序,却不懂得如何看数据,就浪费了宝贵的链特异信息。 嘉因生物做链特异lncRNA-seq,为您提供来自Foward和Reverse的2个bigwig文件,清楚直观的查看蛋白编码基因和antisense上lncRNA的表达情况。 了解去除rRNA和链特异性的RNA-seq,先看 【陈巍学基因】视频15:RiboZero和方向性RNA文库 下面转载一篇生信技能树的帖子: 链特异建库那点事 ,把链特异性建库的问题都掰扯明白了。大长文,列了个提纲。总结几条笔记,免得现在看明白了,过两天又不明白了。 1. RNA-seq基本流程。 反转录用的随机引物并不均匀导致fastQC结果的前6个碱基GC含量分布不均匀。 为什么加接头? 为什么要双端测序? 2. 链特异性测序。原理、流程 3. 正正反反不清楚。 DNA整体概念: 正链:forward; 负(反)链:reverse 局部基因概念: 正义链:sensestrand,携带编码蛋白质信息的链 反义链:antisensestrand,与正义链互补的链 4. dUTP到底是怎么回事 dUTP方式测序R1文件中read1的方向和基因的方向(正义链)是相反的,而R2文件中的read2方向和基因的方向是相同的。 mapping后的bam文件拖到IGV里,怎么看? Read strand模式:看同一片段上测到的两个reads: 蓝色:负链,位于正链的基因此read为R1,位于负链的基因此链为R2; 红色:正链,位于正链的基因此read为R2,位于负链的基因此链为R1; first of pair of strand模式:看片段来自哪条链,以此区分哪些片段来自蛋白编码基因,哪些片段来自antisense上的lncRNA: 蓝色:片段来自正链 红色:片段来自负链 5. 几个常用软件的设置 hisat2 --rna-strandness RF tophat --library-type option fr-firststrand htseq-count:–s reverse rsem:--forward-prob0 sXpress --rf-stranded / --fr-stranded trinity --SSlibtype RF 6. 参数错了又怎样? 链特异性当作普通建库数据: 具体某一个基因而言,影响不会太大,因为绝大多数反义链本身表达量就非常低。 普通建库当作链特异性数据处理: 会损失大量的数据,没剩下几个read。计算出来的结果自然也会有非常大的差异,是不准确的。 链特异建库那点事 原创 2017-10-25 思考问题的熊 关于链特异性测序的若干问题,很久以前就以为自己想清楚了,但是每次提起它的时候又容易重新产生各种各样的小困惑。于是整理一下,以免以后再时不时犯迷糊。很多东西就是这样,你以为的明白并不是真的明白,一年前的明白和一年后的明白也不是同一个明白。我这么说,不知道你能明白还是不明白。 RNA-seq基本流程 下图是一个大概的RNA-seq基本流程 把RNA破碎成小片段,然后将RNA转变成一条cDNA,这一步需要用到反转录酶 reverse transcriptase (RT) 才能用RNA作为模板合成DNA。 不论是转录还是反转录都需要引物。通常如果我们要mRNA,那就可以用oligo-dT作为RT的引物,但是用它有两个问题,第一个是只能反转录那些有A尾巴的RNA,第二个问题是RT不是一个高度持续性的聚合酶,可能让转录提前发生终止,造成的结果就是3'端要比5'端reads富集,这样就会使得后续定量分析带来bias。 另一种常用的引物称为 随机引物 ,随机引物的好处是没有A尾巴的诸如ncRNA也被留下了,而且不会存在明显的3'端偏差。但是很多研究也发现,所谓的随机引物根本就不随机, 这也是测序结果中,通常前6个碱基的GC含量分布特别不均匀的原因 。这几个碱基GC含量均匀很可能不是接头或者barcode那些东西,其实是Illumina 测序RT这一步的random hexamer priming 造成的bias,很多人在处理数据的时候会把这几个碱基去掉,其实很多时候真多RNA-seq数据去不去掉基本什么影响,不过开头如果有低质量的碱基倒是应该去掉。 随后是第二条链合成,这一步用是DNA聚合酶,以刚才合成的第一条链作为模板。 接下来就是在序列两端加上接头,加接头一方面是为了让机器可以识别这些序列,把这些序列固定;二是为了让多个样品可以同时上机,平摊每个样品的测序价格。双端测序为了让read从两边开始延伸,也需要在两端有所需的引物。下表是常见接头种类。 所谓双端测序,因为很多时候read的长度要短于insert,为了增加覆盖度于是就想出了从insert两端同时测序的办法。使得测序深度增加的同时也能够用来判断isoform方向。 对于illumina数据,有一条5-3的universal adaptor;还有一条是3-5的indexed adatpor,这条引物含有特异常的barcode。需要说明的是,在双端测序中,如果insert 不是足够长,那么R1可能就会测到R2的引物,同时R2可能会测到R1引物的反向互补序列。 大概的意思就是下面两张图。 加了接头以后进行PCR的扩增。扩增后就开始测序,测序的过程如下图所示。 测序的基本思想是机器识别四种碱基发出的不同颜色的荧光,可以理解为一个flow cell 立着非常多序列,机器一层一层扫过去,通过识别荧光而判断这一层每个序列的碱基是什么。 因为一个cell密密麻麻的全是荧光信号,机器并不是总能把每一个判断的非常准确,如果某一个荧光信号没有那么清晰,这个碱基的测序质量就比较低,如下图。 有的时候,如果一大片点都是同一种荧光,机器也可能犯晕,不知道到底哪一个荧光属于哪一个序列。这种情况尤其是在序列的前几个碱基容易发生。 The sequencing machine uses the first few bases to establish where the cDNA fragments are on the flow cell. If all of the bases in one part of the flow cell are all the same, like 'C', and all show up green in the picture, then the colors will bleed together and it will not be clear where exactly all of the reads are. In contrast, if you have a lot of different colors in a region, it's easier to determine where each one is, even with a little color bleed. 链特异性测序 和普通的RNAseq不同,链特异性测序可以保留最初产生RNA的方向,普通建库方式为什么不行呢?因为传统建库方式通过两个接头的ligation把RNA已经变成了双链DNA,最后的文库中一部被测序的链对应正义链(sense strand),一部分被测序的链测是反义链。 链特异性建库方式有不止一种,对应到不同的软件又有不同的叫法,下面是几种称呼。 要记住的是dUTP 测序方式的名字是fr-firstrand,也是RF。 至于具体的read方向接下来通过更详细的IGV截图说明问题。 链特异性建库方式(以目前最常用的dUTP为例,如下图所示)首先利用随机引物合成RNA的一条cDNA链,在合成第二条链的时候用dUTP代替dTTP,加adaptor后用UDGase处理,将有U的第二条cDNA降解掉。 这样 最后的insert DNA fragment都是来自于第一条cDNA,也就是dUTP叫fr-firststrand的原因 。对于dUTP数据, tophat 的参数应该为 –library-type fr-firststrand 。这里的first-strand cDNA可不是RNA strand,在使用 htseq-count 时,真正的正义链应该是使用参数 -s reverse 得到的结果。 正正反反不清楚 说到链特异性测序,实在让人困惑的是 各种链的概念 ,尤其是翻译成中文就更说不清了。 DNA 的正链和负链,就是那两条反向互补的链。参考基因组给出的那个链就是所谓的正链(forward),另一条链是反链(reverse)。但是这正反一定 不能和正义链(sense strand)反义链(antisense strand)混淆 ,两条互补的DNA链其中一条携带编码蛋白质信息的链称为正义链,另一条与之互补的称为反义链。但是携带编码信息的正义链不是模板,只是因为它的序列和RNA相同,正义链也是编码链。而反义链虽然和RNA反向互补,但它可是真正给RNA当模板的链,因此反义链也是模板链。 总结两点 正义链(sense strand)= 编码链(coding strand)= 非模板链 forward strand 上可以同时有sense strand 和 antisense strand。因为这完全是两个不同的概念。 写这篇文章的原因,就是因为有人问我,链特异性测序数据 htseq-count 的结果是不是应该把正负链的基因分别在-s yes 和-s reverse 两个参数结果中统计出来再做下游分析。这里犯的错误就是我们混淆了基因组正反链和基因正义反义链的概念。 dUTP到底是怎么回事 从前文的一个图我们可以总结出 dUTP方式测序R1文件中read1 的方向和基因的方向(正义链)是相反的,而R2文件中的read2 方向和基因的方向是相同的。 可以参考下面的两个igv文件bam截图。 首先解释一下igv 两个颜色参数的意义 Read strand in pastels, red for positive rightward (5' to 3') DNA strand , blue for negative leftward (reverse-complement) DNA strand , and grey for unpaired mate, mate not mapped, or otherwise unknown status. First-of-pair strand assignment is dependent on RNA transcript directionality and is useful for directional libraries. Displays reads or read pairs in which the forward read is first (F1 or F1R2) in red and reads or read pairs in which the reverse read is first (R1 or R1F2) in blue. Unknown status is in gray. For a given transcript, non-directional libraries will show a mix of red and blue reads aligning to the locus. Directional libraries will show reads of one color in the direction matching the transcript orientation. 下面这个图示按照igv 颜色选项中的read strand 方向进行区分,可以看到所有 红色read都是在正链方向 (注意正链不是正义链),而所有 蓝色的read都是负链方向 。下面基因的方向是正链方向,也就是和红色的read同向的,如果你把鼠标放到随意一个红色的read上,就能看到显示的信息是 second of pair ,也就是 pair中的read2(R2) ;反之如果你在蓝色的read上面,就会显示信息是first of pair,也就是R1 。 总结,dUTP测序中pair read 中的read1(R1)和基因方向相反,read2(R2)和基因方向相同 再看下面这张图 这张图展示了两个基因1和2,我们可以发现 gene1的正义链就在正链上 ,而 gene2的正义链其实是在反链上 。看read情况,a,c两个read虽然针对正链负链而言方向一致,都是负链方向,但是如果把 a是pair中的read1(first of pair ) ,而 c是pair中的read2(second of pair) 。也就是说, read方向一致,但一个是read1一个是read2,说明这两个read对应的基因一定是反向的。 同样的道理,虽然 b,d都是两个方向为负链的read,但是b其实是所在pair的read2(second of pair),而d是所在pair的read1(first of pair)。 再次强调,dUTP测序中pair read 中的read1(R1)和基因方向相反,read2和基因方向相同 当使用read strand来进行颜色区分的时候,每一个基因上两种颜色的分布应该相对均匀(也就是所谓的pair end)。 如果这个时候把 颜色选项改为按照 first of pair of strand 来区分 ,会出现下图的变化。 gene1的read全部变成了蓝色,而gene2的read全部变成了红色。 如果是非链特异性测序,在 first of pair of strand 模式下,同一个gene相关的read颜色也是明显混杂的。如下图 再一次总结: dUTP 链特异性测序中,RNA 方向(gff文件中基因的方向)与read1相反,与read2相同。如果read1比对到基因组正链上,则对应的gene在基因组负链;如果read2比对到基因组正链则对应的gene在基因组正链。 dTUP 测序方式叫做fr-firstrand(留下的是cDNA第一条链),也是RF。 如果dUTP链特异性测序,看基因表达量应该 counts for the 2nd read strand aligned with RNA(htseq-count option -s reverse, STAR ReadsPerGene.out.tab column 3 ) 如果想看反义链是否有转录本(比如NAT)应该用 the 1st read strand aligned with RNA ( htseq-count option -s yes,STAR ReadsPerGene.out.tab column 4) 几个常用软件的设置 STAR mpping 时无需特别设置,但如果不是链特异性数据且下游分析要用到cufflinks 则需要增加参数 --outSAMstrandField intronMotif。为的是增加一个XS标签。 If you have stranded RNA-seq data, you do not need to use any specific STAR options. Instead, you need to run Cufflinks with the library option --library-type options. For example, cufflinks... --library-type fr-firststrand should be used for the standard dUTP protocol, including Illumina’s stranded Tru-Seq. hisat2 --rna-strandness RF 目的也是给比对序列添加一个XS标签以区分方向,方便cufflinks使用。 For single-end reads, use F or R. 'F' means a read corresponds to a transcript. 'R' means a read corresponds to the reverse complemented counterpart of a transcript. For paired-end reads, use either FR or RF. With this option being used, every read alignment will have an XS attribute tag: '+' means a read belongs to a transcript on '+' strand of genome. '-' means a read belongs to a transcript on '-' strand of genome. tophat --library-type option fr-firststrand 具体解释参见下表 htseq-count -s reverse/yes(看反义链) For stranded=no , a read is considered overlapping with a feature regardless of whether it is mapped to the same or the opposite strand as the feature. For stranded=yes and single-end reads, the read has to be mapped to the same strand as the feature. For paired-end reads, the first read has to be on the same strand and the second read on the opposite strand. For stranded=reverse , these rules are reversed. RSEM --forward-prob 0(正义链)1(看反义链) The RNA-Seq protocol used to generate the reads is strand specific, i.e., all (upstream) reads are derived from the forward strand . This option is equivalent to --forward-prob=1.0 . With this option set, if RSEM runs the Bowtie/Bowtie 2 aligner, the '--norc' Bowtie/Bowtie 2 option will be used, which disables alignment to the reverse strand of transcripts. (Default: off) Probability of generating a read from the forward strand of a transcript. Set to 1 for a strand-specific protocol where all (upstream) reads are derived from the forward strand, 0 for a strand-specific protocol where all (upstream) read are derived from the reverse strand , or 0.5 for a non-strand-specific protocol. (Default: 0.5) sXpress --rf-stranded / --fr-stranded(看反义链) --fr eXpress only accepts alignments (single-end or paired) where the first (or only) read is aligned to the forward target sequence and the second read is aligned to the reverse-complemented target sequence. In directional sequencing, this is equivalent to second-strand only. --rf eXpress only accepts alignments (single-end or paired) where the first (or only) read is aligned to the reverse-completemented target sequence and the second read is aligned to the forward target sequence. In directional sequencing, this is equivalent to first-strand only. trinity --SS lib type RF Trinity performs best with strand-specific data, in which case sense and antisense transcripts can be resolved. RF : first read (/1) of fragment pair is sequenced as anti-sense (reverse( R )), and second read (/2) is in the sense strand (forward( F )); typical of the dUTP/UDG sequencing method. 参数错了又怎样? 到这里,会想问两个问题。有时候我们不知道数据的建库方式是不是链特异性的,如果弄错了结果会怎么样呢? 如果你用STAR mapping 完可以用igv像上文提到的那样,看看是不是链特异性测序。 下面是两个真是数据的count 统计情况。 对于 仅仅进行基因表达定量 来说,把链特异性数据当作普通建库数据来处理,可以观察第2列数据和第4列数据。具体某一个基因而言,影响不会太大,因为绝大多数反义链本身表达量就非常低。 不过可以注意 noFeature 和 ambiguous 这两个值,因为基因组中存在两个基因分别在正链和负链且又重叠的情况,不区分方向会比区分方向的ambiguous数目多一些。因为如果不能通过方向来区分到底属于哪个基因,这样的read就会被认为是ambiguous。 但是因为区分了方向,又会使得noFeature的数目更多一些。不过两者总体影响不会差别非常大。如果不能判断建库方式,在htseq中使用-s no 参数是一个比较保险(虽然不是非常精确)的做法。 相反,如果把普通建库方式的数据当作链特异性数据来处理。 比如在htseq-count中使用了-s reverse 参数,这个时候 只有R2方向和基因方向相同的pair才用来算作一个count ,所有R2和基因方向不同的pair就被当作no feature了。这样的结果影响可以通过下面的表格观察。 用正常方法数出的noFeature 是6万左右,而用-s yes或者reverse数出来的noFeature 就接近46万了。将近40万的read 被丢掉。 所以,如果把普通建库的数据误当作链特异性数据来处理极有可能会损失大量的数据,如果 弄错了链特异性建库的方式 ,那坑你就没几个read剩下了。另外,计算出来的结果自然也会有非常大的差异,是不准确的。 作者: 思考问题的熊 几年前分享过一些考研经历和备考建议而被部分考研人所知,也分享过一些英语学习相关的内容而被调侃戏称为“靠谱学长”。 目前在中科院上海植物生理生态研究所硕博连读,努力在生物信息学菜鸟晋级的路上前进。 参考资料和更多文章可以移步作者博客 http://kaopubear.top
本文转载自嘉因微信公众号,已获得授权。查看最新文章,敬请关注嘉因,微信ID:rainbow-genome 作者:小哈 来源: 嘉因 大家都会做方便面,有人做辛拉面,有人做三鲜伊面,工艺有何不同? 大家都会做RNA-seq,有人能筛出有意义的基因,有人能找出有价值的线索,有人。。。差别在哪? 前四期介绍了数据均一化处理、差异基因筛选、画heatmap和富集分析的合理方法: 第一期:数据预处理: 同一套RNA-seq,为什么公司做的跟师兄跑的结果不一样? | TPM、read counts、RPKM/FPKM你选对了吗? 第二期:差异基因筛选: 同一套RNA-seq,公司筛出的差异基因跟师兄筛出的为什么不一样?| Pvalue, FDR, cutoff 第三期:heatmap: heatmap画不好会得出错误结论 | 数据预处理、聚类分析,HCL、 K means里的讲究 第四期:富集分析: 富集分析,俩人做的结果差5岁 | 你用的注释文件有多老? 小哈让我们算read counts, 可是, 为什么我算的read counts跟公司算的还是不一样 ?本期回过头来看mapping时选用的Gene model对结果的影响。 拿到测序数据,首先要把read回帖到基因组上,这时需要基因组序列fasta文件,还要告诉它基因组上哪个位置有基因,即gene model,保存在 gtf文件 里。 如果分析人或小鼠的数据,就用 GENCODE 。那么,著名的Ensembl、UCSC、Refseq,跟GENCODE是啥关系?其他物种用哪个呢? Ensembl说目前这个版本的 GENCODE = Ensembl ,www.ensembl.org/Help/Faq?id=303 只有GENCODE自己知道,它跟ensembl还是有些区别的,GTF文件稍有不同,www.gencodegenes.org/faq.html 点击查看清晰大图 Ensembl、UCSC、Refseq,选择不同, 对结果有多大影响 ? 有人专门做了对比。 先总体评价了三种gene model对mapping的影响; 然后举例细看对某些基因的具体影响; 先说结论: Gene model 会影响 基因表达量乃至差异表达基因的筛选,尤其是不同gene model对某些基因的长度、junction位点注释有出入; Ensembl的注释相对更加准确,基因更多; 推荐 人鼠用 GENCODE ,谁让它出自最权威的ENCODE呢,其他物种用 Ensembl 。 下面逐个查看文章里的结果: The read mapping summary in the “transcriptome only” and “transcriptome + genome” mapping modes: more reads are mapped in Ensembl than in RefGene and UCSC in the “transcriptome only” mode more reads become multiple-mapped in Ensembl than in RefGene and UCSC The RefGene and UCSC consistently have the highest percentage of uniquely mapped reads; while the percentage of non-uniquely mapped reads is much higher in Ensembl. Without a gene model (indicated in pink) in the mapping step, a constant 6% of reads become unmapped. Divided uniquely mapped reads into two classes, i.e., non-junction reads and junction reads , and investigated the impact of a gene model on their mapping. The impact of a gene model on mapping of non-junction reads is different from junction reads. For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. By contrast , this percentage dropped to 53% for junction reads. In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10– 15% mapped alternatively. The overlap and intersection among RefGene, UCSC, and Ensembl annotations In general, different annotations have very high overlaps: there are 21,598 common genes shared by all three gene models. RefGene has the fewest unique genes while more than 50% of genes in Ensembl are unique . The correlation of gene quantification results between RefGene and Ensembl Although the majority of genes have highly consistent or nearly identical expression levels, there are many genes whose quantification results are dramatically affected by the choice of a gene model 具体看每个基因的read counts,用Ensembl和RefGene算出来的read counts差好远,为什么呢?下面举例看2个基因的情况 The different gene definitions for PIK3CA give rise to differences in gene quantification PIK3CA in the Ensembl annotation is much longer than its definition in RefGene , explaining why there are 1094 reads mapped to PIK3CA in Ensembl , while only 492 reads are mapped in RefGene . The PIK3CA gene definition in Ensembl seems more accurate than the one in RefGene, based upon the mapping profile of sequence reads. The different gene definitions for LUZP6 . In the Ensembl annotation, LUZP6 is only 177 bp long, and it is completely within another gene, MTPN. As a result, all sequence reads originating from LUZP6 are assigned to MTPN instead. In RefGene , LUZP6 and MTPN are derived from the same genomic region, and both encode exactly the same mRNA, though the protein coding sequences are different. Therefore, all reads mapped to this region are equally distributed between these two genes. The correlation of the calculated Log2Ratio (heart/liver) between RefGene and Ensembl. Although the majority of genes have highly consistent expression changes, there are many genes that are remarkably affected by the choice of different gene models .
Trinity无比强大,但是在组装结果还是太零散,此外内存占用也是硬伤。如果你有一个土豪机器,能一次运行几百G的程序,那建议将所有样本的PE reads清理后R1,R2相应合到一起,再用Trinity 进行PE组装。没有强大机器的人们一般会将多个样本分开组装,再通过一些手段对contig再组装,最后生成所有样本共同的Unigene,得到所有样本统一的reference,来比对计算差异表达。 这里推荐一款老牌的聚类组装软件——TGICL ( TGI Clustering tools ),完成这个contig再组装的过程。它先使用mageblast对输入的fasta文件进行比对后聚类成cluster,再使用CAP3对每个cluster进行组装。此外因为聚类后各cluster间相互独立无交集,程序提供多线程的选项(来弥补CAP3这一步计算速度慢的缺陷)。 1. 下载安装 到https://sourceforge.net/projects/tgicl/files/?source=navbar下载最新版的TGICL (以最新版v2.1版为例)。建议下载TGICL-2.1.tar.gz原程序,貌似rpm/deb二进制的安装会有报错,具体没有深究,感兴趣的朋友可以使用rpm/deb格式的安装包进行安装,欢迎将过程分享到此博文下,谢谢。 tar -zxvf TGICL-2.1.tar.gz cd TGICL-2.1 perl Build.PL ./Build ./Build test ./Build install #如果哪一步有permission denied的提示,则需要在命令前加sudo 如果在终端中输入 perldoc -F /usr/local/bin/tgicl 出现TGICL的help页面(如下图)则表示安装完成! 2. TGICL的使用 以某处为工作目录,将需去冗余和组装的fasta序列(例如Trinity组装结果中大于500bp的序列文件trinity_gt500.fasta)放到工作目录下,运行 tgicl -F trinity_gt500.fasta -c 3 这时可能会出现如下提示,可以不用理会: Use of :locked is deprecated at /usr/local/share/perl/5.18.2/TGI/DBDrv.pm line 36. 运行完结果如下 err_tgicl_trinity_gt500.fasta.log和tgicl_trinity_gt500.fasta.log分别是标准错误输出和标准输出的记录文件,如果运行中无报错,两者应该是一样的。文件中记录了建立索引、聚类和组装三个过程的一些信息。 重点来了: Q1:工作目录下trinity_gt500.fasta, trinity_gt500.fasta_cl_clusters, trinity_gt500.fasta.singletons和masked.lst这四个文件之间是什么关系? 为了清晰起见,这里既作了后三个文件中序列的韦恩图,又作了四个文件中序列的韦恩图。 结果显示 A. trinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons文件中的序列居然有交集! B. trinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons的并集就是trinity_gt500.fasta中全体序列 C. masked.lst里的序列既有 trinity_gt500.fasta_cl_clusters里的,又有trinity_gt500.fasta.singletons里的序列 Q2:asm文件夹下的文件与cluster是什么关系? 因为-c指定用3个CPU,这里会生成3个文件夹asm_1,asm_2,asm_3。从这3个文件中的log_std发现这些log_std里的CL没有重复,而且加在一起正好是trinity_gt500.fasta_cl_clusters里的结果,表明程序先将trinity_gt500.fasta_cl_clusters中的cluster分给N个CPU(在N个文件夹中)来做的。然后,CAP3组装每个asm文件夹下log_std中的clusters,然而并不是每个cluster里的序列都能用上,组装没有用上的序列就写进了每个asm文件夹下的singletons文件里。 Q3:在第1个问题里rinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons为什么会有交集,交集又是什么? Bingo! 想必大家也猜到了,这个交集就是所有asm文件夹下singlets文件中序列的并集。为了验证这个猜想,先检查一下所有asm文件夹下singlets文件是否有交集,发现没有,不出所料。所有asm文件夹下singlets文件里的序列的并集就是trinity_gt500.fasta.singletons和trinity_gt500.fasta_cl_clusters序列里的交集。发现正好是的,有猜想一致! 综上所述,最终的结构图是 因此只需要将trinity_gt500.fasta.singletons里的序列提出来,再将各asm文件夹下的contigs合并到一起换个Unigene的ID号即可。 vim collect_tgicl_result.py 将以下代码复制到collect_tgicl_result.py中 #!/usr/bin/env python import sys, os from Bio import SeqIO '''collect_tgicl_result.py''' def main(tgicl, singleton, fasta, *asm_dirs): unigene = 1 with open('result.fasta', 'w') as result: for i in asm_dirs: for x in SeqIO.parse(os.path.join(i, 'contigs'), 'fasta'): print result, 'cluster_contig%s\n%s' %(unigene, x.seq) unigene += 1 fa_dic = {fa.id:fa.seq for fa in SeqIO.parse(fasta, 'fasta')} for i in open(singleton): print result, '%s\n%s' %(i.strip(), fa_dic ) if __name__ == '__main__': if len(sys.argv) == 1: print (' collect TGICL result\n' ' python %s singletons fasta asm*\n' ' singletons is located in workspace\n' ' fasta is fasta file which was used to be clustered and assembled\n' asm* are asm directories, `contigs' file must be in those dirs\n ) %(sys.argv ) sys.exit(1) main(*sys.argv) 在工作目录下运行如下: python collect_tgicl_result.py trinity_gt500 .fasta.singletons trinity_gt500 .fasta asm_* 运行完会生成一个tgicl_result.fasta文件,即TGICL聚类组装后最终的Unigene。 参考材料 [1] Geo Pertea1, Xiaoqiu Huang, Feng Liang, et al. TIGR Gene Indices clustering tools (TGICL):a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19(5):651-652 [2] http://bioinformation.cn/?p=563
我们都知道RNA-seq是通过NGS技术来检测基因表达量的测序方法。在衡量基因表达量方面,若是单纯以比对到参考基因的Reads个数(我们通常称之为Count值)来衡量基因的表达量,在统计上是一件相当不合理的事。今天就为大家介绍一下衡量基因表达量的RPKM和FPKM两种方法。 在随机抽样的情况下,序列较长的基因被抽到的概率本来就会比序列短的基因高,如此一来,序列长的基因永远会被认为表达量较高,而错估基因真正的表达量。在测序深度不同的情况下,测序深度更深的样品中,比对到每个基因的Read数量更多。 为排除因基因的长度、测序深度等因素造成的干扰,RPKM(Reads Per Kilobase Million)和FPKM(Fragments Per Kilobase Million)等方法就应运而生了。 RPKM (Reads Per Kilobase per Million)和 FPKM (Fragments Per Kilobase per Million) 首先需要解释FPKM和RPKM的原理是相似的,区别在于FPKM对应的是DNA片段,比如在一个Illumina的pair-end(双尾)RNA-seq中,一对(两个)reads对应是一个DNA片段。有了FPKM(RPKM)概念,我们就能比较:同一个样本中基因A和基因B的相对表达量;或者不同样本中,同一个基因的相对表达量。 具体的原因是:引入“每一千碱基(per kilobase)”的原因在于,不同的RNA可能有不同长度,长度越长,对应的reads就越多。当每个RNA都除以自身长度(以1000碱基为单位)时,就可以比较同一个样本中不同基因的相对表达量了。相似地,引入“每一百万reads”的原因是,不同的样本可能测序的深度不一样,深度越深,当然对应的reads就越多了。如果结果除以各自库的数量(以一百万reads为单位),那么我们就能很好地衡量两个不同样本中同一个基因的相对表达量。 RPKM RPKM是将Map到基因的Reads数除以Map到Genome的所有Read数(以Million为单位)与RNA的长度(以KB为单位)。 FPKM FPKM是将Map到基因的Fragments数除以Map到Genome的所有Read数(以Million为单位)与RNA的长度(以KB为单位)。 从公式上可以看出,方法是将Reads(Fragments)Count进行标准化,分别是对测序深度标准化(以Million为单位)和对基因长度标准化(以KB为单位),从而消除了因测序深度和基因长度不同对基因表达量的影响。
文章: A survey of best practices for RNA-seq data analysis 2016 ( http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8 ) 文章:A comparison of methods for differential expression analysis of RNA-seq data 2013 ( https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-91 )
Identification of reference genes for qRT-PCR in human lung squamous-cell carcinoma by RNA-Seq Cheng Zhan, Yongxing Zhang, Jun Ma, Lin Wang, Wei Jiang, Yu Shi and Qun Wang Department of Thoracic Surgery, Zhongshan Hospital, Fudan University, Shanghai 200031, China Acta Biochim Biophys Sin 2014, 46: 330–337; doi: 10.1093/abbs/gmt153 Although the accuracy of quantitative real-time polymerase chain reaction (qRT-PCR) is highly dependent on the reliable reference genes, many commonly used reference genes are not stably expressed and as such are not suitable for quantification and normalization of qRT-PCR data. The aim of this study was to identify novel reliable reference genes in lung squamous-cell carcinoma. We used RNA sequencing (RNA-Seq) to survey the whole genome expression in 5 lung normal samples and 44 lung squamous-cell carcinoma samples. We evaluated the expression profiles of 15 commonly used reference genes and identified five additional candidate reference genes. To validate the RNA-Seq dataset, we used qRT-PCR to verify the expression levels of these 20 genes in a separate set of 100 pairs of normal lung tissue and lung squamous-cell carcinoma samples, and then analyzed these results using geNorm and NormFinder. With respect to 14 of the 15 common reference genes (B2M, GAPDH, GUSB, HMBS, HPRT1, IPO8, PGK1, POLR2A, PPIA, RPLP0, TBP, TFRC, UBC, and YWHAZ), the expression levels were either too low to be easily detected, or exhibited a high degree of variability either between lung normal and squamous-cell carcinoma samples, or even among samples of the same tissue type. In contrast, 1 of the 15 common reference genes (ACTB) and the 5 additional candidate reference genes (EEF1A1, FAU, RPS9, RPS11, and RPS14) were stably and constitutively expressed at high levels in all the samples tested. ACTB, EEF1A1, FAU, RPS9, RPS11, and RPS14 are ideal reference genes for qRT-PCR analysis of lung squamous-cell carcinoma, while 14 commonly used qRT-PCR reference genes are less appropriate in this context. Expression profiling of 15 common reference genes in RNA-Seq data 全文: http://abbs.oxfordjournals.org/content/46/4/330.full 相关论文: 1 Evaluation and validation of a robust single cell RNA -amplification protocol through transcriptional profiling of enriched lung cancer initiating cells 2 Reference Genes Selection and Normalization of Oxidative Stress Responsive Genes upon Different Temperature Stress Conditions in Hypericum perforatum L. 3 Connectivity Mapping for Candidate Therapeutics Identification Using Next Generation Sequencing RNA - Seq Data 4 Application of the whole-transcriptome shotgun sequencing approach to the study of Philadelphia-positive acute lymphoblastic leukemia 5 Identification of Pathogen Signatures in Prostate Cancer Using RNA - seq
FPKM, Fragments Kilobase of exon model per millon mapped reads, which can be used to indicate the expression (abundance) characteristics of genes. Now I will describe operation about obtaining interested gene FPKM value. 1.Software Download 1).fastq-dump: convert sra file to fastq file. website: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software 2).bowtie:an ultrafast and memory efficient tool for aligning sequencing reads to long reference sequences. website: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml 3).cufflinks:assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. website: http://cufflinks.cbcb.umd.edu/ 4).gffread: convert gff3 file to gtf file. website: http://cufflinks.cbcb.umd.edu/ (This program is included with cufflinks package) 2. Operation 1) Download genome.fa and genes.gff3 file from genome website; Download sra file from NCBI 2) Format conversion $ fastq-dump -I --split-files SRR123456789.sra # convert sra file to fastq file $ gffread -E genes.gff3 -o genes.gtf # convert gff3 file to gtf file 3) Index files $bowtie2-build genome.fa genome 4) Alignment $bowtie2 -x genome -1 SRR123456789_1.fastq -2 SRR123456789_2.fastq -S SRR123456789.sam $samtools view -bS SRR123456789.sam SRR123456789.bam $samtools sort SRR123456789.bam SRR123456789 5) FPKM values $cufflinks SRR123456789.bam -G genes.gtf -o result After these operations, we can extract FPKM values from genes.frkm_tracking file based on gene ID. Notes: If you find some bugs, please contact me.
List of RNA-Seq bioinformatics tools From Wikipedia, the free encyclopedia RNA-Seq ( RNA-Seq.ppt / RNA-Seq Guide ) is a revolutionary technique to perform transcriptome studies based on next-generation sequencing technologies. This technique is largely dependent on bioinformatics tools developed to support the different steps of the process. Here are listed some of the principal tools commonly employed and links to some related web resources. To follow an integrated guide to the analysis of RNA-seq data, please see - Next Generation Sequencing (NGS)/RNA , Hands-On Tutorial or RNA-Seq Workflow . Also, important links are SEQanswers , RNA-SeqList , RNA-SeqBlog , Biostar and bioscholar . Contents 1 Quality control and pre-processing data 1.1 Quality control and filtering data 1.2 Pre-processing data 2 Alignment Tools 2.1 Short (Unspliced) aligners 2.2 Spliced aligners 2.2.1 Aligners based on known splice junctions (annotation-guided aligners) 2.2.2 De novo Splice Aligners 2.2.2.1 De novo Splice Aligners that also use annotation optionally 2.2.2.2 Other Spliced Aligners 3 Quantitative analysis and Differential Expression 3.1 Multi-tool solutions 4 Workbench (analysis pipeline / integrated solutions) 4.1 Commercial Solutions 4.2 Open Source Solutions 5 Fusion genes/chimeras/translocation finders/structural variations 6 Copy Number Variations identification 7 RNA-Seq simulators 8 Transcriptome assemblers 8.1 Genome-Guided assemblers 8.2 Genome-Independent assemblers 9 Visualization tools 10 Functional, Network Pathway Analysis Tools 11 Further annotation tools for RNA-Seq data 12 RNA-Seq Databases 13 Webinars and Presentations 14 References Quality control and pre-processing data Quality control and filtering data Quality assessment is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases (trimming), linkers, overrepresented sequences or noise to assure a coherent final result. clean_reads clean_reads . condetri condetri . cutadapt cutadapt removes adapter sequences from next-generation sequencing data (Illumina, SOLiD and 454). It is used especially when the read length of the sequencing machine is longer than the sequenced molecule, like the microRNA case. FastQC FastQC is a quality control tool for high-throughput sequence data ( Babraham Institute ) and is developed in Java . Import of data is possible from FastQ files, BAM or SAM format. This tool provides an overview to inform about problematic areas, summary graphs and tables to rapid assessment of data. Results are presented in HTML permanent reports. FastQC can be run as a stand alone application or it can be integrated into a larger pipeline solution. See also seqanswers/FastQC . FASTX FASTX Toolkit is a set of command line tools to manipulate reads in files FASTA or FASTQ format. These commands make possible preprocess the files before mapping with tools like Bowtie. Some of the tasks allowed are: conversion from FASTQ to FASTA format, information about statistics of quality, removing sequencing adapters, filtering and cutting sequences based on quality or conversion DNA / RNA . Flexbar Flexbar performs removal of adapter sequences, trimming and filtering features. FreClu FreClu improves overall alignment accuracy performing sequencing-error correction by trimming short reads, based on a clustering methodology. HTSeq HTSeq . htSeqTools htSeqTools is a Bioconductor package able to perform quality control, processing of data and visualization. htSeqTools makes possible visualize sample correlations, to remove over-amplification artifacts, to assess enrichment efficiency, to correct strand bias and visualize hits. PRINSEQ PRINSEQ generates statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, odds ratios. qrqc qrqc is a Bioconductor package to quick read quality control. RNA-SeQC RNA-SeQC is a tool with application in experiment design, process optimization and quality control before computational analysis. Essentially, provides three types of quality control: read counts (such as duplicate reads, mapped reads and mapped unique reads, rRNA reads, transcript-annotated reads, strand specificity), coverage (like mean coverage, mean coefficient of variation, 5’/3’ coverage, gaps in coverage, GC bias) and expression correlation (the tool provides RPKM-based estimation of expression levels). RNA-SeQC is implemented in Java and is not required installation, however can be run using the GenePattern web interface. The input could be one or more BAM files. HTML reports are generated as output. RSeQC RSeQC analyzes diverse aspects of RNA-Seq experiments: sequence quality, sequencing depth, strand specificity, GC bias, read distribution over the genome structure and coverage uniformity. The input can be SAM, BAM, FASTA, BED files or Chromosome size file (two-column, plain text file). Visualization can be performed by genome browsers like UCSC, IGB and IGV. However, R scripts can also be used to visualization. Sabre sabre . SAMStat SAMStat identifies problems and reports several statistics at different phases of the process. This tool evaluates unmapped, poorly and accurately mapped sequences independently to infer possible causes of poor mapping. Scythe scythe . SEECER seecer SEECER is a sequencing error correction algorithm for RNA-seq data sets. It takes the raw read sequences produced by a next generation sequencing platform like machines from Illumina or Roche. SEECER removes mismatch and indel errors from the raw reads and significantly improves downstream analysis of the data. Especially if the RNA-Seq data is used to produce a de novo transcriptome assembly, running SEECER can have tremendous impact on the quality of the assembly. Sickle Sickle . ShortRead ShortRead is a package provided in the R (programming language) / BioConductor environments and allows input, manipulation, quality assessment and output of next-generation sequencing data. This tool makes possible manipulation of data, such as filter solutions to remove reads based on predefined criteria. ShortRead could be complemented with several Bioconductor packages to further analysis and visualization solutions ( BioStrings , BSgenome , IRanges , and so on). See also seqanswers/ShortRead . SysCall SysCall is a classifier tool to identification and correction of systematic error in high-throughput sequence data. Trimmomatic Trimmomatic performs trimming for Illumina platforms and works with FASTQ reads (single or pair-ended). Some of the tasks executed are: cut adapters, cut bases in optional positions based on quality thresholds, cut reads to a specific length, converts quality scores to Phred-33/64. Pre-processing data Further tasks performed before alignment. DeconRNASeq DeconRNASeq is an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data. FastQ Screen FastQ Screen screens FASTQ format sequences to a set of databases to confirm that the sequences contain what is expected (such as species content, adapters, vectors, etc). FLASH FLASH is a read pre-processing tool. FLASH combines paired-end reads which overlap and converts them to single long reads. IDCheck IDCheck . Alignment Tools After control assessment, the first step of RNA-Seq analysis involves alignment (RNA-Seq alignment) of the sequenced reads to a reference genome (if available) or to a transcriptome database. See also List of sequence alignment software and HTS Mappers . Short (Unspliced) aligners Short aligners are able to align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types: 1) based on the Burrows-Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods, Needleman-Wunsch or Smith-Waterman algorithms. The first group (Bowtie and BWA) is many times faster, however some tools of the second group, despite the time spent tend to be more sensitive, generating more reads correctly aligned. See a comparative study of short aligners - comparative study . BFAST BFAST aligns short reads to reference sequences and presents particular sensitivity towards errors, SNPs, insertions and deletions. BFAST works with the Smith-Waterman algorithm. See also seqanwers/BFAST . Bowtie Bowtie is a fast short aligner using an algorithm based on the Burrows-Wheeler transform and the FM-index . Bowtie tolerates a small number of mismatches. See also seqanswers/Bowtie . Burrows-Wheeler Aligner (BWA) BWA implements two algorithms, mainly based on Burrows–Wheeler transform . The first algorithm is used with reads with low error rate (3%). The second algorithm was designed to handle more errors and implements a combined strategy: Burrows–Wheeler transform and Smith-Waterman method. BWA allows mismatches and small gaps (insertions and deletions). The output is presented in SAM format. See also seqanswers/BWA . Short Oligonucleotide Analysis Package (SOAP) SOAP . GNUMAP GNUMAP performs alignment using a probabilistic Needleman-Wunsch algorithm. This tool is able to handle alignment in repetitive regions of a genome without losing information. The output of the program was developed to make possible easy visualization using available software. Maq Maq first aligns reads to reference sequences and after performs a consensus stage. On the first stage performs only ungapped alignment and tolerates up to 3 mismatches. See also seqanswers/Maq . Mosaik Mosaik . Mosaik is able to align reads containing short gaps using Smith-Waterman algorithm , ideal to overcome SNPs, insertions and deletions. See also seqanswers/Mosaik . NovoAlign (commercial) NovoAlign is a short aligner to the Illumina platform based on Needleman-Wunsch algorithm. Novoalign tolerates up to 8 mismatches per read, and up to 7bp of indels. It is able to deal with bisulphite data. Output in SAM format. See also seqanswers/NovoAlign . RazerS RazerS . See also seqanswers/RazerS . SEAL SEAL uses a MapReduce model to produce distributed computing on clusters of computers. Seal uses BWA to perform alignment and Picard MarkDuplicates to detection and duplicate read removal. See also seqanswers/SEAL . SeqMap SeqMap . See also seqanswers/SeqMap . SHRiMP SHRiMP employs two techniques to align short reads. Firstly, the q-gram filtering technique based on multiple seeds identifies candidate regions. Secondly, these regions are investigated in detail using Smith-Waterman algorithm. See also seqanswers/SHRiMP . SMALT Smalt . Stampy Stampy combines the sensitivity of hash tables and the speed of BWA. Stampy is prepared to alignment of reads containing sequence variation like insertions and deletions. It is able to deal with reads up to 4500 bases and presents the output in SAM format. See also seqanswers/Stampy . ZOOM (commercial) ZOOM is a short aligner of the Illumina/Solexa 1G platform. ZOOM uses extended spaced seeds methodology building hash tables for the reads, and tolerates mismatches and insertions and deletions. See also seqanswers/ZOOM . Spliced aligners Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary - Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently. See also Methods to study splicing from high-throughput RNA Sequencing data and Methods to Study RNA-Seq (workflow) . Aligners based on known splice junctions (annotation-guided aligners) In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags (EST). Erange Erange is a tool to alignment and data quantification to mammalian transcriptomes. See also seqanswers/Erange . IsoformEx IsoformEx . MapAL MapAL . OSA OSA . RNA-MATE RNA-MATE is a computational pipeline for alignment of data from Applied Biosystems SOLID system. Provides the possibility of quality control and trimming of reads. The genome alignments are performed using mapreads and the splice junctions are identified based on a library of known exon-junction sequences. This tool allows visualization of alignments and tag counting. See also seqanswers/RNA-MATE . RUM RUM performs alignment based on a pipeline, being able to manipulate reads with splice junctions, using Bowtie and Blat. The flowchart starts doing alignment against a genome and a transcriptome database executed by Bowtie. The next step is to perform alignment of unmapped sequences to the genome of reference using BLAT. In the final step all alignments are merged to get the final alignment. The input files can be in FASTA or FASTQ format. The output is presented in RUM and SAM format. RNASEQR RNASEQR . See also seqanswers/RNASEQR . SAMMate SAMMate . See also seqanswers/SAMMate . SpliceSeq SpliceSeq . X-Mate X-Mate . De novo Splice Aligners De novo Splice aligners allow the detection of new Splice junctions without need to previous annotated information (some of these tools present annotation as a suplementar option). See also De novo Splice Aligners . ABMapper ABMapper . See also seqanswers/ABMapper . ContextMap ContextMap was developed to overcome some limitations of other mapping approaches, such as resolution of ambiguities. The central idea of this tool is to consider reads in gene expression context, improving this way alignment accuracy. ContextMap can be used as a stand-alone program and supported by mappers producing a SAM file in the output (e.g.: TopHat or MapSplice). In stand-alone mode aligns reads to a genome, to a transcriptome database or both. CRAC CRAC propose a novel way of analyzing reads that integrates genomic locations and local coverage, and detect candidate mutations, indels, splice or fusion junctions in each single read. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses. GSNAP GSNAP . See also seqanswers/GSNAP . HMMSplicer HMMSplicer can identify canonical and non-canonical splice junctions in short-reads. Firstly, unspliced reads are removed with Bowtie. After that, the remaining reads are one at a time divided in half, then each part is seeded against a genome and the exon borders are determined based on the Hidden Markov Model . A quality score is assigned to each junction, useful to detect false positive rates. See also seqanswers/HMMSplicer . MapSplice MapSplice . See also seqanswers/MapSplice . OLego OLego . See also seqanswers/OLego . PALMapper PALMapper . See also seqanswers/PALMapper . Pass Pass aligns gapped, ungapped reads and also bisulfite sequencing data. It includes the possibility to filter data before alignment (remotion of adapters). Pass uses Needleman-Wunsch and Smith-Waterman algorithms, and performs alignment in 3 stages: scanning positions of seed sequences in the genome, testing the contiguous regions and finally refining the alignment. See also seqanswers/Pass . PASSion PASSion . PASTA PASTA . QPALMA QPALMA predicts splice junctions supported on machine learning algorithms. In this case the training set is a set of spliced reads with quality information and already known alignments. See also seqanswers/QPALMA . SeqSaw SeqSaw . SoapSplice SoapSplice . SpliceMap SpliceMap . See also seqanswers/SpliceMap . SplitSeek SplitSeek . See also seqanswers/SplitSeek . SuperSplat SuperSplat was developed to find all type of splice junctions. The algorithm splits each read in all possible two-chunk combinations in an iterative way, and alignment is tried to each chunck. Output in “Supersplat” format. See also seqanswers/SuperSplat . Subread Subread is a superfast, accurate and scalable read aligner. It uses the seed-and-vote mapping paradigm to determine the mapping location of the read by using its largest mappable region. It automatically decides whether the read should be globally mapped or locally mapped. For RNA-seq data, Subread should be used for the purpose of expression analysis. Subread is very powerful in mapping gDNA-seq reads as well. See also seqanswers/Subread . Subjunc Subjunc is a specialized version of Subread. It uses all mappable regions in an RNA-seq read to discover exons and exon-exon junctions. It uses the donor/receptor signals to find the exact splicing locations. Subjunc yields full alignments for every RNA-seq read including exon-spanning reads, in addition to the discovered exon-exon junctions. Subjunc should be used for the purpose of junction detection and genomic variation detection in RNA-seq data. See also seqanswers/Subjunc . TrueSight TrueSight . De novo Splice Aligners that also use annotation optionally GEM . MapNext MapNext . See also seqanswers/MapNext . STAR STAR is an ultrafast tool that employs “sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure”, detects canonical, non-canonical splices junctions and chimeric-fusion sequences. It is already adapted to align long reads (third-generation sequencing technologies) and can reach speeds of 45 million paired reads per hour per processor. See also seqanswers/STAR . TopHat TopHat is prepared to find de novo junctions. TopHat aligns reads in two steps. Firstly, unspliced reads are aligned with Bowtie. After, the aligned reads are assembled with Maq resulting islands of sequences. Secondly, the splice junctions are determined based on the initially unmapped reads and the possible canonical donor and acceptor sites within the island sequences. See also seqanswers/TopHat . Other Spliced Aligners G.Mo.R-Se G.Mo.R-Se is a method that uses RNA-Seq reads to build de novo gene models. Quantitative analysis and Differential Expression These tools calculate the abundance of each gene expressed in a RNA-Seq sample (see also Quantification models ). Some softwares are also designed to study the variability of genetic expression between samples (differential expression). Quantitative and differential studies are largely determined by the quality of reads alignment and accuracy of isoforms reconstruction. See a comparative study of differential expression methods and Which method should you use for normalization of rna-seq data? . Alexa-Seq Alexa-Seq is a pipeline that makes possible to perform gene expression analysis, transcript specific expression analysis, exon junction expression and quantitative alternative analysis. Allows wide alternative expression visualization, statistics and graphs. See also seqanswers/Alexa-Seq . ASC ASC . See also seqanswers/ASC . BaySeq BaySeq is a Bioconductor package to identify differential expression using next-generation sequencing data, via empirical Bayesian methods . There is an option of using the “snow” package for parallelisation of computer data processing, recommended when dealing with large data sets. See also seqanswers/BaySeq . BBSeq BBSeq . See also seqanswers/BBSeq . BitSeq BitSeq . CEDER CEDER . CPTRA CPTRA . casper casper is a Bioconductor package to quantify expression at the isoform level. It combines using informative data summaries, flexible estimation of experimental biases and statistical precision considerations which (reportedly) provide substantial reductions in estimation error. Cufflinks Cufflinks is appropriate to measure global de novo transcript isoform expression. It performs assembly of transcripts, estimation of abundances and determines differential expression and regulation in RNA-Seq samples. See also seqanswers/Cufflinks . DESeq DESeq is a Bioconductor package to perform differential gene expression analysis based on negative binomial distribution. See also seqanswers/DESeq . DEGSeq DEGSeq . See also seqanswers/DEGSeq . DEXSeq DEXSeq is Bioconductor package that finds differential differential exon usage based on RNA-Seq exon counts between samples. DEXSeq employs negative binomial distribuition, provides options to visualization and exploration of the results. DiffSplice DiffSplice is a method for differential expression detection and visualization, not dependent on gene annotations. This method is supported on identification of alternative spicing modules (ASMs) that diverge in the different isoforms. A non-parametric test is applied to each ASM to identify significant differential transcription with a measured false discovery rate. EBSeq EBSeq . EdgeR EdgeR is a R package for analysis of differential expression of data from DNA sequencing methods, like RNA-Seq, SAGE or ChIP-Seq data. edgeR employs statistical methods supported on negative binomial distribution as a model for count variability. See also seqanswers/EdgeR . eXpress eXpress performance includes transcript-level RNA-Seq quantification, allele-specific and haplotype analysis and can estimate transcript abundances of the multiple isoforms present in a gene. Although could be coupled directly with aligners (like Bowtie), eXpress can also be used with de novo assemblers and thus is not needed a reference genome to perform alignment. It runs on Linux, Mac and Windows. ERANGE ERANGE performs alignment, normalization and quantification of expressed genes. See also seqanswers/ERANGE . featureCounts featureCounts an efficient read summarization/quantification program. It is implemented in both SourceForge Subread package and Bioconductor Rsubread package . FDM FDM GPSeq GPSeq MATS MATS . MISO MISO quantifies the expression level of splice variants from RNA-Seq data and is able to recognize differentially regulated exons/isoforms across different samples. MISO uses a probabilistic method (Bayesian inference) to calculate the probability of the reads origin. See also seqanswers/MISO . MMSEQ MMSEQ is a pipeline for estimating isoform expression and allelic imbalance in diploid organisms based on RNA-Seq. The pipeline employs tools like Bowtie, TopHat, ArrayExpressHTS and SAMtools. Also, edgeR or DESeq to perform differential expression. See also seqanswers/MMSEQ . Myrna Myrna is a pipeline tool that runs in a cloud environment ( Elastic MapReduce ) or in a unique computer for estimating differential gene expression in RNA-Seq datasets. Bowtie is employed for short read alignment and R algorithms for interval calculations, normalization, and statistical processing. See also seqanswers/Myrna . NEUMA NEUMA is a tool to estimate RNA abundances using length normalization, based on uniquely aligned reads and mRNA isoform models. NEUMA uses known transcriptome data available in databases like RefSeq . NOISeq NOISeq . See also seqanswers/NOISeq . NSMAP NSMAP allows inference of isoforms as well estimation of expression levels, without annotated information. The exons are aligned and splice junctions are identified using TopHat. All the possible isoforms are computed by combination of the detected exons. RNAeXpress RNAeXpress Can be run with Java GUI or command line on Mac, Windows and Linux. Can be configured to perform read counting, feature detection or GTF comparison on mapped rnaseq data. rSeq rSeq RSEM RSEM . See also seqanswers/RSEM . rQuant rQuant is a web service ( Galaxy (computational biology) installation) that determines abundances of transcripts per gene locus, based on quadratic programming . rQuant is able to evaluate biases introduced by experimental conditions. A combination of tools is employed: PALMapper (reads alignment), mTiM and mGene (inference of new transcripts). Scotty Scotty Performs power analysis to estimate the number of replicates and depth of sequencing required to call differential expression. SpliceTrap SpliceTrap . SplicingCompass SplicingCompass . Multi-tool solutions DEB DEB is a web-interface/pipeline that permits to compare results of significantly expressed genes from different tools. Currently are available three algorithms: edgeR, DESeq and bayseq. Workbench (analysis pipeline / integrated solutions) See also an interesting blog 16 rna-seq tools you have to consider for your analysis pipeline . Commercial Solutions Avadis NGS Avadis NGS . CLC Genomics Workbench CLC Genomics Workbench DNASTAR DNASTAR GeneSpring GX GeneSpring GX geospiza geospiza Golden Helix Golden Helix NextGENe NextGENe Partek Partek Open Source Solutions ArrayExpressHTS ArrayExpressHTS (and ebi_ArrayExpressHTS ) is a BioConductor package that allows preprocessing, quality assessment and estimation of expression of RNA-Seq datasets. It can be run remotely at the European Bioinformatics Institute cloud or locally. The package makes use of several tools: ShortRead (quality control), Bowtie, TopHat or BWA (alignment to a reference genome), SAMtools format, Cufflinks or MMSEQ (expression estimation). See also seqanswers/ArrayExpressHTS . BiNGS!SL-seq . Chipster Chipster . easyRNASeq easyRNASeq . ExpressionPlot ExpressionPlot . FX FX . Galaxy : Galaxy is a general purpose workbench platform for computational biology. There are several publicly accessible Galaxy servers that support RNA-Seq tools and workflows, including NBIC's Andromeda , the CBIIT-Giga server, the Galaxy Project's public server , the GeneNetwork Galaxy server , the University of Oslo 's Genomic Hyperbrowser , URGI 's server (which supports S-MART), and many others. GENE-Counter GENE-Counter is a Perl pipeline for RNA-Seq differential gene expression analyses. Gene-counter performs alignments with CASHX, Bowtie, BWA or other SAM output aligner. Differential gene expression is run with three optional packages (NBPSeq, edgeR and DESeq) using negative binomial distribution methods. Results are stored in a MySQL database to make possible additional analyses. GenePattern GenePattern offers integrated solutions to RNA-Seq analysis ( Broad Institute ). GeneProf GeneProf : Freely accessible, easy to use analysis pipelines for RNA-seq and ChIP-seq experiments. MultiExperiment Viewer (MeV) MeV is suitable to perform analysis, data mining and visualization of large-scale genomic data. The MeV modules include a variety of algorithms to execute tasks like Clustering and Classification, Student's t-test , Gene Set Enrichment Analysis or Significance Analysis. MeV runs on Java . See also seqanswers/MeV . NGSUtils NGSUtils . RobiNA RobiNA provides a user graphical interface to deal with R/BioConductor packages. RobiNA provides a package that automatically installs all required external tools (R/Bioconductor frameworks and Bowtie ). This tool offers a diversity of quality control methods and the possibility to produce many tables and plots supplying detailed results for differential expression. Furthermore, the results can be visualized and manipulated with MapMan and PageMan . RobiNA runs on Java version 6. S-MART S-MART handles mapped RNA-Seq data, and performs essentially data manipulation (selection/exclusion of reads, clustering and differential expression analysis) and visualization (read information, distribution, comparison with epigenomic ChIP-Seq data). It can be run on any laptop by a person without computer background. A friendly graphycal user interface makes easy the operation of the tools. See also seqanswers/S-MART . Taverna Taverna . wapRNA wapRNA . Fusion genes/chimeras/translocation finders/structural variations Genome arrangements result of diseases like cancer can produce aberrant genetic modifications like fusions or translocations. Identification of these modifications play important role in carcinogenesis studies. BreakDancer BreakDancer . See also seqanswers/BreakDancer . ChimeraScan ChimeraScan . EBARDenovo EBARDenovo . FusionAnalyser FusionAnalyser . FusionCatcher FusionCatcher . FusionHunter FusionHunter identifies fusion transcripts without depending on already known annotations. It uses Bowtie as a first aligner and paired-end reads. See also seqanswers/FusionHunter . FusionMap FusionMap . FusionSeq FusionSeq . See also seqanswers/FusionSeq . SOAPFuse SOAPFuse . SOAPfusion Soapf usion . TopHat-Fusion TopHat-Fusion is based on TopHat version and was developed to handle reads resulting from fusion genes. It does not require previous data about known genes and uses Bowtie to align continuous reads. See also seqanswers/TopHat-Fusion . ViralFusionSeq ViralFusionSeq is high-throughput sequencing (HTS) tool for discovering viral integration events and reconstruct fusion transcripts at single-base resolution. See also hkbic/VFS and SEQWiki/VFS . DeFuse DeFuse . Copy Number Variations identification CNVseq CNVseq detects copy number variations supported on a statistical model derived from array-comparative genomic hybridization . Sequences alignment are performed by BLAT, calculations are executed by R modules and is fully automated using Perl. See also seqanswers/CNVseq . RNA-Seq simulators These Simulators generate in silico reads and are a useful tool to compare and test the efficiency of algorithms developed to handle RNA-Seq data. Moreover, some of them make possible to analyse and model RNA-Seq protocols. BEERS Simulator BEERS is formatted to mouse or human data, and paired-end reads sequenced on Illumina platform. Beers generates reads starting from a pool of gene models coming from different published annotation origins. Some genes are chosen randomly and afterwards are introduced deliberately errors (like indels, base changes and low quality tails), followed by construction of novel splice junctions. dwgsim dwgsim . Flux simulator Flux Simulator implements a computer pipeline simulation to mimic a RNA-Seq experiment. All component steps that influence RNA-Seq are taken into account (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing) in the simulation. These steps present experimental attributes that can be measured, and the approximate experimental biases are captured. Flux Simulator allows joining each of these steps as modules to analyse different type of protocols. See also seqanswers/Flux . RSEM Read Simulator rsem-simulate-reads . RNASeqReadSimulator RNASeqReadSimulator contains a set of simple Python scripts, command line driven. It generates random expression levels of transcripts (single or paired-end), equally simulates reads with a specific positional bias pattern and generates random errors from sequencing platforms. RNA Seq Simulator RNA Seq Simulator . Transcriptome assemblers The transcriptome is the total population of RNAs expressed in one cell or group of cells, including non-coding and protein-coding RNAs. There are two types of approaches to assemble transcriptomes. Genome-guided methods use a reference genome (if possible a finished and high quality genome) as a template to align and assembling reads into transcripts. Genome-independent methods does not require a reference genome and are normally used when a genome is not available. In this case reads are assembled directly in transcripts. Genome-Guided assemblers Cufflinks Cufflinks . IsoInfer IsoInfer . IsoLasso IsoLasso . RNAeXpress RNAeXpress . Scripture Scripture . See also seqanswers/Scripture . Genome-Independent assemblers KISSPLICE KISSPLICE . Oases Oases . See also seqanswers/Oases . Rnnotator . SOAPdenovo SOAPdenovo . See also seqanswers/SOAPdenovo . Scaffolding Translation Mapping (STM) . Trans-ABySS Trans-AByss . See also seqanswers/Trans-ABySS . Trinity Trinity . See also seqanswers/Trinity . Velvet Velvet (algorithm) . Velvet(EMBL-EBI) . See also seqanswers/Velvet . Visualization tools Artemis Artemis . Apollo Apollo . EagleView EagleView . GBrowse GBrowse . Integrated Genome Browser IGB . Integrative Genomics Viewer (IGV) IGV . GenomeView genomeview . MapView MapView . Tablet Tablet Tbrowse- HTML5 Transcriptome Browser Tbrowse . Savant Savant Samscope Samscope . SeqMonk SeqMonk . See also seqanswers/SeqMonk . Vespa Vespa . Functional, Network Pathway Analysis Tools Ingenuity Systems (commercial) iReport IPA : Ingenuity’s IPA and iReport applications enable you to upload, analyze, and visualize RNA-Seq datasets, eliminating the obstacles between data and biological insight. Both IPA and iReport support identification, analysis and interpretation of differentially expressed isoforms between condition and control samples, and support interpretation and assessment of expression changes in the context of biological processes, disease and cellular phenotypes, and molecular interactions. Ingenuity iReport supports the upload of native Cuffdiff file format as well as gene expression lists. IPA supports the upload of gene expression lists. Gene Set Association Analysis for RNA-Seq ( GSAA-Seq ) : GSAA-Seq is a computational method that assesses the differential expression of a pathway/gene set between two biological states based on RNA-Seq data. Further annotation tools for RNA-Seq data seq2HLA seq2HLA is an annotation tool for obtaining an individual's HLA class I and II type and expression using standard NGS RNA-Seq data in fastq format. It comprises mapping RNA-Seq reads against a reference database of HLA alleles using bowtie , determining and reporting HLA type, confidence score and locus-specific expression level. This tool is developed in Python and R . It is available as console tool or Galaxy module. See also seqanswers/seq2HLA . HLAminer HLAminer is a computational method for identifying HLA alleles directly from whole genome, exome and transcriptome shotgun sequence datasets. HLA allele predictions are derived by targeted assembly of shotgun sequence data and comparison to a database of reference allele sequences. This tool is developed in perl and it is available as console tool. pasa pasa . RNA-Seq Databases queryable-rna-seq-database queryable-rna-seq-database . RNA-Seq Atlas RNA-Seq Atlas . SRA SRA . Webinars and Presentations RNASeq-Blog Presentations RNA-Seq Workshop Documentation (UC Davis University) VIDEO: Strategies for Identifying Biologically Compelling Genes from Breast Cancer Subtype RNA-Seq Profiles with Accompanying Analysis Princeton Workshop Youtube/RNA-Seq NGS Leaders COFACTOR genomics References ^ Wang Z, Gerstein M, Snyder M. (January 2009). RNA-Seq: a revolutionary tool for transcriptomics . Nature Reviews Genetics 10 (1): 57–63. doi : 10.1038/nrg2484 . PMC 2949280 . PMID 19015660 . ^ a b Yang Liao, Gordon K Smyth and Wei Shi (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote . Nucleic Acids Research 41 . doi : 10.1093/nar/gkt214 . PMID 23558742 . ^ http://bioinformatics.oxfordjournals.org/content/29/1/15.full ^ Cole Trapnell, Lior Pachter and Steven Salzberg (2009). TopHat: discovering splice junctions with RNA-Seq . Bioinformatics 25 (9): 1105–1111. doi : 10.1093/bioinformatics/btp120 . PMC 2672628 . PMID 19289445 . ^ Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold and Lior Pachter (2010). Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms . Nature Biotechnology 28 (5): 511–515. doi : 10.1038/nbt.1621 . PMC 3146043 . PMID 20436464 . ^ Zerbino DR, Birney E (2008). Velvet: Algorithms for de novo short read assembly using de Bruijn graphs . Genome Research 18 (5): 821–829. doi : 10.1101/gr.074492.107 . PMC 2336801 . PMID 18349386 . zz: http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools