本文转载自嘉因微信公众号,已获得授权。查看最新文章,敬请关注嘉因,微信ID:rainbow-genome 作者:小哈 来源: 嘉因 大家都会做方便面,有人做辛拉面,有人做三鲜伊面,工艺有何不同? 大家都会做RNA-seq,有人能筛出有意义的基因,有人能找出有价值的线索,有人。。。差别在哪? 前四期介绍了数据均一化处理、差异基因筛选、画heatmap和富集分析的合理方法: 第一期:数据预处理: 同一套RNA-seq,为什么公司做的跟师兄跑的结果不一样? | TPM、read counts、RPKM/FPKM你选对了吗? 第二期:差异基因筛选: 同一套RNA-seq,公司筛出的差异基因跟师兄筛出的为什么不一样?| Pvalue, FDR, cutoff 第三期:heatmap: heatmap画不好会得出错误结论 | 数据预处理、聚类分析,HCL、 K means里的讲究 第四期:富集分析: 富集分析,俩人做的结果差5岁 | 你用的注释文件有多老? 小哈让我们算read counts, 可是, 为什么我算的read counts跟公司算的还是不一样 ?本期回过头来看mapping时选用的Gene model对结果的影响。 拿到测序数据,首先要把read回帖到基因组上,这时需要基因组序列fasta文件,还要告诉它基因组上哪个位置有基因,即gene model,保存在 gtf文件 里。 如果分析人或小鼠的数据,就用 GENCODE 。那么,著名的Ensembl、UCSC、Refseq,跟GENCODE是啥关系?其他物种用哪个呢? Ensembl说目前这个版本的 GENCODE = Ensembl ,www.ensembl.org/Help/Faq?id=303 只有GENCODE自己知道,它跟ensembl还是有些区别的,GTF文件稍有不同,www.gencodegenes.org/faq.html 点击查看清晰大图 Ensembl、UCSC、Refseq,选择不同, 对结果有多大影响 ? 有人专门做了对比。 先总体评价了三种gene model对mapping的影响; 然后举例细看对某些基因的具体影响; 先说结论: Gene model 会影响 基因表达量乃至差异表达基因的筛选,尤其是不同gene model对某些基因的长度、junction位点注释有出入; Ensembl的注释相对更加准确,基因更多; 推荐 人鼠用 GENCODE ,谁让它出自最权威的ENCODE呢,其他物种用 Ensembl 。 下面逐个查看文章里的结果: The read mapping summary in the “transcriptome only” and “transcriptome + genome” mapping modes: more reads are mapped in Ensembl than in RefGene and UCSC in the “transcriptome only” mode more reads become multiple-mapped in Ensembl than in RefGene and UCSC The RefGene and UCSC consistently have the highest percentage of uniquely mapped reads; while the percentage of non-uniquely mapped reads is much higher in Ensembl. Without a gene model (indicated in pink) in the mapping step, a constant 6% of reads become unmapped. Divided uniquely mapped reads into two classes, i.e., non-junction reads and junction reads , and investigated the impact of a gene model on their mapping. The impact of a gene model on mapping of non-junction reads is different from junction reads. For the RNA-Seq dataset with a read length of 75 bp, on average, 95% of non-junction reads were mapped to exactly the same genomic location regardless of which gene models was used. By contrast , this percentage dropped to 53% for junction reads. In addition, about 30% of junction reads failed to align without the assistance of a gene model, while 10– 15% mapped alternatively. The overlap and intersection among RefGene, UCSC, and Ensembl annotations In general, different annotations have very high overlaps: there are 21,598 common genes shared by all three gene models. RefGene has the fewest unique genes while more than 50% of genes in Ensembl are unique . The correlation of gene quantification results between RefGene and Ensembl Although the majority of genes have highly consistent or nearly identical expression levels, there are many genes whose quantification results are dramatically affected by the choice of a gene model 具体看每个基因的read counts,用Ensembl和RefGene算出来的read counts差好远,为什么呢?下面举例看2个基因的情况 The different gene definitions for PIK3CA give rise to differences in gene quantification PIK3CA in the Ensembl annotation is much longer than its definition in RefGene , explaining why there are 1094 reads mapped to PIK3CA in Ensembl , while only 492 reads are mapped in RefGene . The PIK3CA gene definition in Ensembl seems more accurate than the one in RefGene, based upon the mapping profile of sequence reads. The different gene definitions for LUZP6 . In the Ensembl annotation, LUZP6 is only 177 bp long, and it is completely within another gene, MTPN. As a result, all sequence reads originating from LUZP6 are assigned to MTPN instead. In RefGene , LUZP6 and MTPN are derived from the same genomic region, and both encode exactly the same mRNA, though the protein coding sequences are different. Therefore, all reads mapped to this region are equally distributed between these two genes. The correlation of the calculated Log2Ratio (heart/liver) between RefGene and Ensembl. Although the majority of genes have highly consistent expression changes, there are many genes that are remarkably affected by the choice of different gene models .
Trinity无比强大,但是在组装结果还是太零散,此外内存占用也是硬伤。如果你有一个土豪机器,能一次运行几百G的程序,那建议将所有样本的PE reads清理后R1,R2相应合到一起,再用Trinity 进行PE组装。没有强大机器的人们一般会将多个样本分开组装,再通过一些手段对contig再组装,最后生成所有样本共同的Unigene,得到所有样本统一的reference,来比对计算差异表达。 这里推荐一款老牌的聚类组装软件——TGICL ( TGI Clustering tools ),完成这个contig再组装的过程。它先使用mageblast对输入的fasta文件进行比对后聚类成cluster,再使用CAP3对每个cluster进行组装。此外因为聚类后各cluster间相互独立无交集,程序提供多线程的选项(来弥补CAP3这一步计算速度慢的缺陷)。 1. 下载安装 到https://sourceforge.net/projects/tgicl/files/?source=navbar下载最新版的TGICL (以最新版v2.1版为例)。建议下载TGICL-2.1.tar.gz原程序,貌似rpm/deb二进制的安装会有报错,具体没有深究,感兴趣的朋友可以使用rpm/deb格式的安装包进行安装,欢迎将过程分享到此博文下,谢谢。 tar -zxvf TGICL-2.1.tar.gz cd TGICL-2.1 perl Build.PL ./Build ./Build test ./Build install #如果哪一步有permission denied的提示,则需要在命令前加sudo 如果在终端中输入 perldoc -F /usr/local/bin/tgicl 出现TGICL的help页面(如下图)则表示安装完成! 2. TGICL的使用 以某处为工作目录,将需去冗余和组装的fasta序列(例如Trinity组装结果中大于500bp的序列文件trinity_gt500.fasta)放到工作目录下,运行 tgicl -F trinity_gt500.fasta -c 3 这时可能会出现如下提示,可以不用理会: Use of :locked is deprecated at /usr/local/share/perl/5.18.2/TGI/DBDrv.pm line 36. 运行完结果如下 err_tgicl_trinity_gt500.fasta.log和tgicl_trinity_gt500.fasta.log分别是标准错误输出和标准输出的记录文件,如果运行中无报错,两者应该是一样的。文件中记录了建立索引、聚类和组装三个过程的一些信息。 重点来了: Q1:工作目录下trinity_gt500.fasta, trinity_gt500.fasta_cl_clusters, trinity_gt500.fasta.singletons和masked.lst这四个文件之间是什么关系? 为了清晰起见,这里既作了后三个文件中序列的韦恩图,又作了四个文件中序列的韦恩图。 结果显示 A. trinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons文件中的序列居然有交集! B. trinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons的并集就是trinity_gt500.fasta中全体序列 C. masked.lst里的序列既有 trinity_gt500.fasta_cl_clusters里的,又有trinity_gt500.fasta.singletons里的序列 Q2:asm文件夹下的文件与cluster是什么关系? 因为-c指定用3个CPU,这里会生成3个文件夹asm_1,asm_2,asm_3。从这3个文件中的log_std发现这些log_std里的CL没有重复,而且加在一起正好是trinity_gt500.fasta_cl_clusters里的结果,表明程序先将trinity_gt500.fasta_cl_clusters中的cluster分给N个CPU(在N个文件夹中)来做的。然后,CAP3组装每个asm文件夹下log_std中的clusters,然而并不是每个cluster里的序列都能用上,组装没有用上的序列就写进了每个asm文件夹下的singletons文件里。 Q3:在第1个问题里rinity_gt500.fasta_cl_clusters和trinity_gt500.fasta.singletons为什么会有交集,交集又是什么? Bingo! 想必大家也猜到了,这个交集就是所有asm文件夹下singlets文件中序列的并集。为了验证这个猜想,先检查一下所有asm文件夹下singlets文件是否有交集,发现没有,不出所料。所有asm文件夹下singlets文件里的序列的并集就是trinity_gt500.fasta.singletons和trinity_gt500.fasta_cl_clusters序列里的交集。发现正好是的,有猜想一致! 综上所述,最终的结构图是 因此只需要将trinity_gt500.fasta.singletons里的序列提出来,再将各asm文件夹下的contigs合并到一起换个Unigene的ID号即可。 vim collect_tgicl_result.py 将以下代码复制到collect_tgicl_result.py中 #!/usr/bin/env python import sys, os from Bio import SeqIO '''collect_tgicl_result.py''' def main(tgicl, singleton, fasta, *asm_dirs): unigene = 1 with open('result.fasta', 'w') as result: for i in asm_dirs: for x in SeqIO.parse(os.path.join(i, 'contigs'), 'fasta'): print result, 'cluster_contig%s\n%s' %(unigene, x.seq) unigene += 1 fa_dic = {fa.id:fa.seq for fa in SeqIO.parse(fasta, 'fasta')} for i in open(singleton): print result, '%s\n%s' %(i.strip(), fa_dic ) if __name__ == '__main__': if len(sys.argv) == 1: print (' collect TGICL result\n' ' python %s singletons fasta asm*\n' ' singletons is located in workspace\n' ' fasta is fasta file which was used to be clustered and assembled\n' asm* are asm directories, `contigs' file must be in those dirs\n ) %(sys.argv ) sys.exit(1) main(*sys.argv) 在工作目录下运行如下: python collect_tgicl_result.py trinity_gt500 .fasta.singletons trinity_gt500 .fasta asm_* 运行完会生成一个tgicl_result.fasta文件,即TGICL聚类组装后最终的Unigene。 参考材料 [1] Geo Pertea1, Xiaoqiu Huang, Feng Liang, et al. TIGR Gene Indices clustering tools (TGICL):a software system for fast clustering of large EST datasets. Bioinformatics. 2003, 19(5):651-652 [2] http://bioinformation.cn/?p=563
我们都知道RNA-seq是通过NGS技术来检测基因表达量的测序方法。在衡量基因表达量方面,若是单纯以比对到参考基因的Reads个数(我们通常称之为Count值)来衡量基因的表达量,在统计上是一件相当不合理的事。今天就为大家介绍一下衡量基因表达量的RPKM和FPKM两种方法。 在随机抽样的情况下,序列较长的基因被抽到的概率本来就会比序列短的基因高,如此一来,序列长的基因永远会被认为表达量较高,而错估基因真正的表达量。在测序深度不同的情况下,测序深度更深的样品中,比对到每个基因的Read数量更多。 为排除因基因的长度、测序深度等因素造成的干扰,RPKM(Reads Per Kilobase Million)和FPKM(Fragments Per Kilobase Million)等方法就应运而生了。 RPKM (Reads Per Kilobase per Million)和 FPKM (Fragments Per Kilobase per Million) 首先需要解释FPKM和RPKM的原理是相似的,区别在于FPKM对应的是DNA片段,比如在一个Illumina的pair-end(双尾)RNA-seq中,一对(两个)reads对应是一个DNA片段。有了FPKM(RPKM)概念,我们就能比较:同一个样本中基因A和基因B的相对表达量;或者不同样本中,同一个基因的相对表达量。 具体的原因是:引入“每一千碱基(per kilobase)”的原因在于,不同的RNA可能有不同长度,长度越长,对应的reads就越多。当每个RNA都除以自身长度(以1000碱基为单位)时,就可以比较同一个样本中不同基因的相对表达量了。相似地,引入“每一百万reads”的原因是,不同的样本可能测序的深度不一样,深度越深,当然对应的reads就越多了。如果结果除以各自库的数量(以一百万reads为单位),那么我们就能很好地衡量两个不同样本中同一个基因的相对表达量。 RPKM RPKM是将Map到基因的Reads数除以Map到Genome的所有Read数(以Million为单位)与RNA的长度(以KB为单位)。 FPKM FPKM是将Map到基因的Fragments数除以Map到Genome的所有Read数(以Million为单位)与RNA的长度(以KB为单位)。 从公式上可以看出,方法是将Reads(Fragments)Count进行标准化,分别是对测序深度标准化(以Million为单位)和对基因长度标准化(以KB为单位),从而消除了因测序深度和基因长度不同对基因表达量的影响。
文章: A survey of best practices for RNA-seq data analysis 2016 ( http://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8 ) 文章:A comparison of methods for differential expression analysis of RNA-seq data 2013 ( https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-91 )
Identification of reference genes for qRT-PCR in human lung squamous-cell carcinoma by RNA-Seq Cheng Zhan, Yongxing Zhang, Jun Ma, Lin Wang, Wei Jiang, Yu Shi and Qun Wang Department of Thoracic Surgery, Zhongshan Hospital, Fudan University, Shanghai 200031, China Acta Biochim Biophys Sin 2014, 46: 330–337; doi: 10.1093/abbs/gmt153 Although the accuracy of quantitative real-time polymerase chain reaction (qRT-PCR) is highly dependent on the reliable reference genes, many commonly used reference genes are not stably expressed and as such are not suitable for quantification and normalization of qRT-PCR data. The aim of this study was to identify novel reliable reference genes in lung squamous-cell carcinoma. We used RNA sequencing (RNA-Seq) to survey the whole genome expression in 5 lung normal samples and 44 lung squamous-cell carcinoma samples. We evaluated the expression profiles of 15 commonly used reference genes and identified five additional candidate reference genes. To validate the RNA-Seq dataset, we used qRT-PCR to verify the expression levels of these 20 genes in a separate set of 100 pairs of normal lung tissue and lung squamous-cell carcinoma samples, and then analyzed these results using geNorm and NormFinder. With respect to 14 of the 15 common reference genes (B2M, GAPDH, GUSB, HMBS, HPRT1, IPO8, PGK1, POLR2A, PPIA, RPLP0, TBP, TFRC, UBC, and YWHAZ), the expression levels were either too low to be easily detected, or exhibited a high degree of variability either between lung normal and squamous-cell carcinoma samples, or even among samples of the same tissue type. In contrast, 1 of the 15 common reference genes (ACTB) and the 5 additional candidate reference genes (EEF1A1, FAU, RPS9, RPS11, and RPS14) were stably and constitutively expressed at high levels in all the samples tested. ACTB, EEF1A1, FAU, RPS9, RPS11, and RPS14 are ideal reference genes for qRT-PCR analysis of lung squamous-cell carcinoma, while 14 commonly used qRT-PCR reference genes are less appropriate in this context. Expression profiling of 15 common reference genes in RNA-Seq data 全文: http://abbs.oxfordjournals.org/content/46/4/330.full 相关论文: 1 Evaluation and validation of a robust single cell RNA -amplification protocol through transcriptional profiling of enriched lung cancer initiating cells 2 Reference Genes Selection and Normalization of Oxidative Stress Responsive Genes upon Different Temperature Stress Conditions in Hypericum perforatum L. 3 Connectivity Mapping for Candidate Therapeutics Identification Using Next Generation Sequencing RNA - Seq Data 4 Application of the whole-transcriptome shotgun sequencing approach to the study of Philadelphia-positive acute lymphoblastic leukemia 5 Identification of Pathogen Signatures in Prostate Cancer Using RNA - seq
FPKM, Fragments Kilobase of exon model per millon mapped reads, which can be used to indicate the expression (abundance) characteristics of genes. Now I will describe operation about obtaining interested gene FPKM value. 1.Software Download 1).fastq-dump: convert sra file to fastq file. website: http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software 2).bowtie:an ultrafast and memory efficient tool for aligning sequencing reads to long reference sequences. website: http://bowtie-bio.sourceforge.net/bowtie2/index.shtml 3).cufflinks:assembles transcripts, estimates their abundances, and tests for differential expression and regulation in RNA-Seq samples. website: http://cufflinks.cbcb.umd.edu/ 4).gffread: convert gff3 file to gtf file. website: http://cufflinks.cbcb.umd.edu/ (This program is included with cufflinks package) 2. Operation 1) Download genome.fa and genes.gff3 file from genome website; Download sra file from NCBI 2) Format conversion $ fastq-dump -I --split-files SRR123456789.sra # convert sra file to fastq file $ gffread -E genes.gff3 -o genes.gtf # convert gff3 file to gtf file 3) Index files $bowtie2-build genome.fa genome 4) Alignment $bowtie2 -x genome -1 SRR123456789_1.fastq -2 SRR123456789_2.fastq -S SRR123456789.sam $samtools view -bS SRR123456789.sam SRR123456789.bam $samtools sort SRR123456789.bam SRR123456789 5) FPKM values $cufflinks SRR123456789.bam -G genes.gtf -o result After these operations, we can extract FPKM values from genes.frkm_tracking file based on gene ID. Notes: If you find some bugs, please contact me.
List of RNA-Seq bioinformatics tools From Wikipedia, the free encyclopedia RNA-Seq ( RNA-Seq.ppt / RNA-Seq Guide ) is a revolutionary technique to perform transcriptome studies based on next-generation sequencing technologies. This technique is largely dependent on bioinformatics tools developed to support the different steps of the process. Here are listed some of the principal tools commonly employed and links to some related web resources. To follow an integrated guide to the analysis of RNA-seq data, please see - Next Generation Sequencing (NGS)/RNA , Hands-On Tutorial or RNA-Seq Workflow . Also, important links are SEQanswers , RNA-SeqList , RNA-SeqBlog , Biostar and bioscholar . Contents 1 Quality control and pre-processing data 1.1 Quality control and filtering data 1.2 Pre-processing data 2 Alignment Tools 2.1 Short (Unspliced) aligners 2.2 Spliced aligners 2.2.1 Aligners based on known splice junctions (annotation-guided aligners) 2.2.2 De novo Splice Aligners 2.2.2.1 De novo Splice Aligners that also use annotation optionally 2.2.2.2 Other Spliced Aligners 3 Quantitative analysis and Differential Expression 3.1 Multi-tool solutions 4 Workbench (analysis pipeline / integrated solutions) 4.1 Commercial Solutions 4.2 Open Source Solutions 5 Fusion genes/chimeras/translocation finders/structural variations 6 Copy Number Variations identification 7 RNA-Seq simulators 8 Transcriptome assemblers 8.1 Genome-Guided assemblers 8.2 Genome-Independent assemblers 9 Visualization tools 10 Functional, Network Pathway Analysis Tools 11 Further annotation tools for RNA-Seq data 12 RNA-Seq Databases 13 Webinars and Presentations 14 References Quality control and pre-processing data Quality control and filtering data Quality assessment is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases (trimming), linkers, overrepresented sequences or noise to assure a coherent final result. clean_reads clean_reads . condetri condetri . cutadapt cutadapt removes adapter sequences from next-generation sequencing data (Illumina, SOLiD and 454). It is used especially when the read length of the sequencing machine is longer than the sequenced molecule, like the microRNA case. FastQC FastQC is a quality control tool for high-throughput sequence data ( Babraham Institute ) and is developed in Java . Import of data is possible from FastQ files, BAM or SAM format. This tool provides an overview to inform about problematic areas, summary graphs and tables to rapid assessment of data. Results are presented in HTML permanent reports. FastQC can be run as a stand alone application or it can be integrated into a larger pipeline solution. See also seqanswers/FastQC . FASTX FASTX Toolkit is a set of command line tools to manipulate reads in files FASTA or FASTQ format. These commands make possible preprocess the files before mapping with tools like Bowtie. Some of the tasks allowed are: conversion from FASTQ to FASTA format, information about statistics of quality, removing sequencing adapters, filtering and cutting sequences based on quality or conversion DNA / RNA . Flexbar Flexbar performs removal of adapter sequences, trimming and filtering features. FreClu FreClu improves overall alignment accuracy performing sequencing-error correction by trimming short reads, based on a clustering methodology. HTSeq HTSeq . htSeqTools htSeqTools is a Bioconductor package able to perform quality control, processing of data and visualization. htSeqTools makes possible visualize sample correlations, to remove over-amplification artifacts, to assess enrichment efficiency, to correct strand bias and visualize hits. PRINSEQ PRINSEQ generates statistics of your sequence data for sequence length, GC content, quality scores, n-plicates, complexity, tag sequences, poly-A/T tails, odds ratios. qrqc qrqc is a Bioconductor package to quick read quality control. RNA-SeQC RNA-SeQC is a tool with application in experiment design, process optimization and quality control before computational analysis. Essentially, provides three types of quality control: read counts (such as duplicate reads, mapped reads and mapped unique reads, rRNA reads, transcript-annotated reads, strand specificity), coverage (like mean coverage, mean coefficient of variation, 5’/3’ coverage, gaps in coverage, GC bias) and expression correlation (the tool provides RPKM-based estimation of expression levels). RNA-SeQC is implemented in Java and is not required installation, however can be run using the GenePattern web interface. The input could be one or more BAM files. HTML reports are generated as output. RSeQC RSeQC analyzes diverse aspects of RNA-Seq experiments: sequence quality, sequencing depth, strand specificity, GC bias, read distribution over the genome structure and coverage uniformity. The input can be SAM, BAM, FASTA, BED files or Chromosome size file (two-column, plain text file). Visualization can be performed by genome browsers like UCSC, IGB and IGV. However, R scripts can also be used to visualization. Sabre sabre . SAMStat SAMStat identifies problems and reports several statistics at different phases of the process. This tool evaluates unmapped, poorly and accurately mapped sequences independently to infer possible causes of poor mapping. Scythe scythe . SEECER seecer SEECER is a sequencing error correction algorithm for RNA-seq data sets. It takes the raw read sequences produced by a next generation sequencing platform like machines from Illumina or Roche. SEECER removes mismatch and indel errors from the raw reads and significantly improves downstream analysis of the data. Especially if the RNA-Seq data is used to produce a de novo transcriptome assembly, running SEECER can have tremendous impact on the quality of the assembly. Sickle Sickle . ShortRead ShortRead is a package provided in the R (programming language) / BioConductor environments and allows input, manipulation, quality assessment and output of next-generation sequencing data. This tool makes possible manipulation of data, such as filter solutions to remove reads based on predefined criteria. ShortRead could be complemented with several Bioconductor packages to further analysis and visualization solutions ( BioStrings , BSgenome , IRanges , and so on). See also seqanswers/ShortRead . SysCall SysCall is a classifier tool to identification and correction of systematic error in high-throughput sequence data. Trimmomatic Trimmomatic performs trimming for Illumina platforms and works with FASTQ reads (single or pair-ended). Some of the tasks executed are: cut adapters, cut bases in optional positions based on quality thresholds, cut reads to a specific length, converts quality scores to Phred-33/64. Pre-processing data Further tasks performed before alignment. DeconRNASeq DeconRNASeq is an R package for deconvolution of heterogeneous tissues based on mRNA-Seq data. FastQ Screen FastQ Screen screens FASTQ format sequences to a set of databases to confirm that the sequences contain what is expected (such as species content, adapters, vectors, etc). FLASH FLASH is a read pre-processing tool. FLASH combines paired-end reads which overlap and converts them to single long reads. IDCheck IDCheck . Alignment Tools After control assessment, the first step of RNA-Seq analysis involves alignment (RNA-Seq alignment) of the sequenced reads to a reference genome (if available) or to a transcriptome database. See also List of sequence alignment software and HTS Mappers . Short (Unspliced) aligners Short aligners are able to align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types: 1) based on the Burrows-Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods, Needleman-Wunsch or Smith-Waterman algorithms. The first group (Bowtie and BWA) is many times faster, however some tools of the second group, despite the time spent tend to be more sensitive, generating more reads correctly aligned. See a comparative study of short aligners - comparative study . BFAST BFAST aligns short reads to reference sequences and presents particular sensitivity towards errors, SNPs, insertions and deletions. BFAST works with the Smith-Waterman algorithm. See also seqanwers/BFAST . Bowtie Bowtie is a fast short aligner using an algorithm based on the Burrows-Wheeler transform and the FM-index . Bowtie tolerates a small number of mismatches. See also seqanswers/Bowtie . Burrows-Wheeler Aligner (BWA) BWA implements two algorithms, mainly based on Burrows–Wheeler transform . The first algorithm is used with reads with low error rate (3%). The second algorithm was designed to handle more errors and implements a combined strategy: Burrows–Wheeler transform and Smith-Waterman method. BWA allows mismatches and small gaps (insertions and deletions). The output is presented in SAM format. See also seqanswers/BWA . Short Oligonucleotide Analysis Package (SOAP) SOAP . GNUMAP GNUMAP performs alignment using a probabilistic Needleman-Wunsch algorithm. This tool is able to handle alignment in repetitive regions of a genome without losing information. The output of the program was developed to make possible easy visualization using available software. Maq Maq first aligns reads to reference sequences and after performs a consensus stage. On the first stage performs only ungapped alignment and tolerates up to 3 mismatches. See also seqanswers/Maq . Mosaik Mosaik . Mosaik is able to align reads containing short gaps using Smith-Waterman algorithm , ideal to overcome SNPs, insertions and deletions. See also seqanswers/Mosaik . NovoAlign (commercial) NovoAlign is a short aligner to the Illumina platform based on Needleman-Wunsch algorithm. Novoalign tolerates up to 8 mismatches per read, and up to 7bp of indels. It is able to deal with bisulphite data. Output in SAM format. See also seqanswers/NovoAlign . RazerS RazerS . See also seqanswers/RazerS . SEAL SEAL uses a MapReduce model to produce distributed computing on clusters of computers. Seal uses BWA to perform alignment and Picard MarkDuplicates to detection and duplicate read removal. See also seqanswers/SEAL . SeqMap SeqMap . See also seqanswers/SeqMap . SHRiMP SHRiMP employs two techniques to align short reads. Firstly, the q-gram filtering technique based on multiple seeds identifies candidate regions. Secondly, these regions are investigated in detail using Smith-Waterman algorithm. See also seqanswers/SHRiMP . SMALT Smalt . Stampy Stampy combines the sensitivity of hash tables and the speed of BWA. Stampy is prepared to alignment of reads containing sequence variation like insertions and deletions. It is able to deal with reads up to 4500 bases and presents the output in SAM format. See also seqanswers/Stampy . ZOOM (commercial) ZOOM is a short aligner of the Illumina/Solexa 1G platform. ZOOM uses extended spaced seeds methodology building hash tables for the reads, and tolerates mismatches and insertions and deletions. See also seqanswers/ZOOM . Spliced aligners Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary - Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions - normally the reads are split into smaller segments and mapped independently. See also Methods to study splicing from high-throughput RNA Sequencing data and Methods to Study RNA-Seq (workflow) . Aligners based on known splice junctions (annotation-guided aligners) In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags (EST). Erange Erange is a tool to alignment and data quantification to mammalian transcriptomes. See also seqanswers/Erange . IsoformEx IsoformEx . MapAL MapAL . OSA OSA . RNA-MATE RNA-MATE is a computational pipeline for alignment of data from Applied Biosystems SOLID system. Provides the possibility of quality control and trimming of reads. The genome alignments are performed using mapreads and the splice junctions are identified based on a library of known exon-junction sequences. This tool allows visualization of alignments and tag counting. See also seqanswers/RNA-MATE . RUM RUM performs alignment based on a pipeline, being able to manipulate reads with splice junctions, using Bowtie and Blat. The flowchart starts doing alignment against a genome and a transcriptome database executed by Bowtie. The next step is to perform alignment of unmapped sequences to the genome of reference using BLAT. In the final step all alignments are merged to get the final alignment. The input files can be in FASTA or FASTQ format. The output is presented in RUM and SAM format. RNASEQR RNASEQR . See also seqanswers/RNASEQR . SAMMate SAMMate . See also seqanswers/SAMMate . SpliceSeq SpliceSeq . X-Mate X-Mate . De novo Splice Aligners De novo Splice aligners allow the detection of new Splice junctions without need to previous annotated information (some of these tools present annotation as a suplementar option). See also De novo Splice Aligners . ABMapper ABMapper . See also seqanswers/ABMapper . ContextMap ContextMap was developed to overcome some limitations of other mapping approaches, such as resolution of ambiguities. The central idea of this tool is to consider reads in gene expression context, improving this way alignment accuracy. ContextMap can be used as a stand-alone program and supported by mappers producing a SAM file in the output (e.g.: TopHat or MapSplice). In stand-alone mode aligns reads to a genome, to a transcriptome database or both. CRAC CRAC propose a novel way of analyzing reads that integrates genomic locations and local coverage, and detect candidate mutations, indels, splice or fusion junctions in each single read. Importantly, CRAC improves its predictive performance when supplied with e.g. 200 nt reads and should fit future needs of read analyses. GSNAP GSNAP . See also seqanswers/GSNAP . HMMSplicer HMMSplicer can identify canonical and non-canonical splice junctions in short-reads. Firstly, unspliced reads are removed with Bowtie. After that, the remaining reads are one at a time divided in half, then each part is seeded against a genome and the exon borders are determined based on the Hidden Markov Model . A quality score is assigned to each junction, useful to detect false positive rates. See also seqanswers/HMMSplicer . MapSplice MapSplice . See also seqanswers/MapSplice . OLego OLego . See also seqanswers/OLego . PALMapper PALMapper . See also seqanswers/PALMapper . Pass Pass aligns gapped, ungapped reads and also bisulfite sequencing data. It includes the possibility to filter data before alignment (remotion of adapters). Pass uses Needleman-Wunsch and Smith-Waterman algorithms, and performs alignment in 3 stages: scanning positions of seed sequences in the genome, testing the contiguous regions and finally refining the alignment. See also seqanswers/Pass . PASSion PASSion . PASTA PASTA . QPALMA QPALMA predicts splice junctions supported on machine learning algorithms. In this case the training set is a set of spliced reads with quality information and already known alignments. See also seqanswers/QPALMA . SeqSaw SeqSaw . SoapSplice SoapSplice . SpliceMap SpliceMap . See also seqanswers/SpliceMap . SplitSeek SplitSeek . See also seqanswers/SplitSeek . SuperSplat SuperSplat was developed to find all type of splice junctions. The algorithm splits each read in all possible two-chunk combinations in an iterative way, and alignment is tried to each chunck. Output in “Supersplat” format. See also seqanswers/SuperSplat . Subread Subread is a superfast, accurate and scalable read aligner. It uses the seed-and-vote mapping paradigm to determine the mapping location of the read by using its largest mappable region. It automatically decides whether the read should be globally mapped or locally mapped. For RNA-seq data, Subread should be used for the purpose of expression analysis. Subread is very powerful in mapping gDNA-seq reads as well. See also seqanswers/Subread . Subjunc Subjunc is a specialized version of Subread. It uses all mappable regions in an RNA-seq read to discover exons and exon-exon junctions. It uses the donor/receptor signals to find the exact splicing locations. Subjunc yields full alignments for every RNA-seq read including exon-spanning reads, in addition to the discovered exon-exon junctions. Subjunc should be used for the purpose of junction detection and genomic variation detection in RNA-seq data. See also seqanswers/Subjunc . TrueSight TrueSight . De novo Splice Aligners that also use annotation optionally GEM . MapNext MapNext . See also seqanswers/MapNext . STAR STAR is an ultrafast tool that employs “sequential maximum mappable seed search in uncompressed suffix arrays followed by seed clustering and stitching procedure”, detects canonical, non-canonical splices junctions and chimeric-fusion sequences. It is already adapted to align long reads (third-generation sequencing technologies) and can reach speeds of 45 million paired reads per hour per processor. See also seqanswers/STAR . TopHat TopHat is prepared to find de novo junctions. TopHat aligns reads in two steps. Firstly, unspliced reads are aligned with Bowtie. After, the aligned reads are assembled with Maq resulting islands of sequences. Secondly, the splice junctions are determined based on the initially unmapped reads and the possible canonical donor and acceptor sites within the island sequences. See also seqanswers/TopHat . Other Spliced Aligners G.Mo.R-Se G.Mo.R-Se is a method that uses RNA-Seq reads to build de novo gene models. Quantitative analysis and Differential Expression These tools calculate the abundance of each gene expressed in a RNA-Seq sample (see also Quantification models ). Some softwares are also designed to study the variability of genetic expression between samples (differential expression). Quantitative and differential studies are largely determined by the quality of reads alignment and accuracy of isoforms reconstruction. See a comparative study of differential expression methods and Which method should you use for normalization of rna-seq data? . Alexa-Seq Alexa-Seq is a pipeline that makes possible to perform gene expression analysis, transcript specific expression analysis, exon junction expression and quantitative alternative analysis. Allows wide alternative expression visualization, statistics and graphs. See also seqanswers/Alexa-Seq . ASC ASC . See also seqanswers/ASC . BaySeq BaySeq is a Bioconductor package to identify differential expression using next-generation sequencing data, via empirical Bayesian methods . There is an option of using the “snow” package for parallelisation of computer data processing, recommended when dealing with large data sets. See also seqanswers/BaySeq . BBSeq BBSeq . See also seqanswers/BBSeq . BitSeq BitSeq . CEDER CEDER . CPTRA CPTRA . casper casper is a Bioconductor package to quantify expression at the isoform level. It combines using informative data summaries, flexible estimation of experimental biases and statistical precision considerations which (reportedly) provide substantial reductions in estimation error. Cufflinks Cufflinks is appropriate to measure global de novo transcript isoform expression. It performs assembly of transcripts, estimation of abundances and determines differential expression and regulation in RNA-Seq samples. See also seqanswers/Cufflinks . DESeq DESeq is a Bioconductor package to perform differential gene expression analysis based on negative binomial distribution. See also seqanswers/DESeq . DEGSeq DEGSeq . See also seqanswers/DEGSeq . DEXSeq DEXSeq is Bioconductor package that finds differential differential exon usage based on RNA-Seq exon counts between samples. DEXSeq employs negative binomial distribuition, provides options to visualization and exploration of the results. DiffSplice DiffSplice is a method for differential expression detection and visualization, not dependent on gene annotations. This method is supported on identification of alternative spicing modules (ASMs) that diverge in the different isoforms. A non-parametric test is applied to each ASM to identify significant differential transcription with a measured false discovery rate. EBSeq EBSeq . EdgeR EdgeR is a R package for analysis of differential expression of data from DNA sequencing methods, like RNA-Seq, SAGE or ChIP-Seq data. edgeR employs statistical methods supported on negative binomial distribution as a model for count variability. See also seqanswers/EdgeR . eXpress eXpress performance includes transcript-level RNA-Seq quantification, allele-specific and haplotype analysis and can estimate transcript abundances of the multiple isoforms present in a gene. Although could be coupled directly with aligners (like Bowtie), eXpress can also be used with de novo assemblers and thus is not needed a reference genome to perform alignment. It runs on Linux, Mac and Windows. ERANGE ERANGE performs alignment, normalization and quantification of expressed genes. See also seqanswers/ERANGE . featureCounts featureCounts an efficient read summarization/quantification program. It is implemented in both SourceForge Subread package and Bioconductor Rsubread package . FDM FDM GPSeq GPSeq MATS MATS . MISO MISO quantifies the expression level of splice variants from RNA-Seq data and is able to recognize differentially regulated exons/isoforms across different samples. MISO uses a probabilistic method (Bayesian inference) to calculate the probability of the reads origin. See also seqanswers/MISO . MMSEQ MMSEQ is a pipeline for estimating isoform expression and allelic imbalance in diploid organisms based on RNA-Seq. The pipeline employs tools like Bowtie, TopHat, ArrayExpressHTS and SAMtools. Also, edgeR or DESeq to perform differential expression. See also seqanswers/MMSEQ . Myrna Myrna is a pipeline tool that runs in a cloud environment ( Elastic MapReduce ) or in a unique computer for estimating differential gene expression in RNA-Seq datasets. Bowtie is employed for short read alignment and R algorithms for interval calculations, normalization, and statistical processing. See also seqanswers/Myrna . NEUMA NEUMA is a tool to estimate RNA abundances using length normalization, based on uniquely aligned reads and mRNA isoform models. NEUMA uses known transcriptome data available in databases like RefSeq . NOISeq NOISeq . See also seqanswers/NOISeq . NSMAP NSMAP allows inference of isoforms as well estimation of expression levels, without annotated information. The exons are aligned and splice junctions are identified using TopHat. All the possible isoforms are computed by combination of the detected exons. RNAeXpress RNAeXpress Can be run with Java GUI or command line on Mac, Windows and Linux. Can be configured to perform read counting, feature detection or GTF comparison on mapped rnaseq data. rSeq rSeq RSEM RSEM . See also seqanswers/RSEM . rQuant rQuant is a web service ( Galaxy (computational biology) installation) that determines abundances of transcripts per gene locus, based on quadratic programming . rQuant is able to evaluate biases introduced by experimental conditions. A combination of tools is employed: PALMapper (reads alignment), mTiM and mGene (inference of new transcripts). Scotty Scotty Performs power analysis to estimate the number of replicates and depth of sequencing required to call differential expression. SpliceTrap SpliceTrap . SplicingCompass SplicingCompass . Multi-tool solutions DEB DEB is a web-interface/pipeline that permits to compare results of significantly expressed genes from different tools. Currently are available three algorithms: edgeR, DESeq and bayseq. Workbench (analysis pipeline / integrated solutions) See also an interesting blog 16 rna-seq tools you have to consider for your analysis pipeline . Commercial Solutions Avadis NGS Avadis NGS . CLC Genomics Workbench CLC Genomics Workbench DNASTAR DNASTAR GeneSpring GX GeneSpring GX geospiza geospiza Golden Helix Golden Helix NextGENe NextGENe Partek Partek Open Source Solutions ArrayExpressHTS ArrayExpressHTS (and ebi_ArrayExpressHTS ) is a BioConductor package that allows preprocessing, quality assessment and estimation of expression of RNA-Seq datasets. It can be run remotely at the European Bioinformatics Institute cloud or locally. The package makes use of several tools: ShortRead (quality control), Bowtie, TopHat or BWA (alignment to a reference genome), SAMtools format, Cufflinks or MMSEQ (expression estimation). See also seqanswers/ArrayExpressHTS . BiNGS!SL-seq . Chipster Chipster . easyRNASeq easyRNASeq . ExpressionPlot ExpressionPlot . FX FX . Galaxy : Galaxy is a general purpose workbench platform for computational biology. There are several publicly accessible Galaxy servers that support RNA-Seq tools and workflows, including NBIC's Andromeda , the CBIIT-Giga server, the Galaxy Project's public server , the GeneNetwork Galaxy server , the University of Oslo 's Genomic Hyperbrowser , URGI 's server (which supports S-MART), and many others. GENE-Counter GENE-Counter is a Perl pipeline for RNA-Seq differential gene expression analyses. Gene-counter performs alignments with CASHX, Bowtie, BWA or other SAM output aligner. Differential gene expression is run with three optional packages (NBPSeq, edgeR and DESeq) using negative binomial distribution methods. Results are stored in a MySQL database to make possible additional analyses. GenePattern GenePattern offers integrated solutions to RNA-Seq analysis ( Broad Institute ). GeneProf GeneProf : Freely accessible, easy to use analysis pipelines for RNA-seq and ChIP-seq experiments. MultiExperiment Viewer (MeV) MeV is suitable to perform analysis, data mining and visualization of large-scale genomic data. The MeV modules include a variety of algorithms to execute tasks like Clustering and Classification, Student's t-test , Gene Set Enrichment Analysis or Significance Analysis. MeV runs on Java . See also seqanswers/MeV . NGSUtils NGSUtils . RobiNA RobiNA provides a user graphical interface to deal with R/BioConductor packages. RobiNA provides a package that automatically installs all required external tools (R/Bioconductor frameworks and Bowtie ). This tool offers a diversity of quality control methods and the possibility to produce many tables and plots supplying detailed results for differential expression. Furthermore, the results can be visualized and manipulated with MapMan and PageMan . RobiNA runs on Java version 6. S-MART S-MART handles mapped RNA-Seq data, and performs essentially data manipulation (selection/exclusion of reads, clustering and differential expression analysis) and visualization (read information, distribution, comparison with epigenomic ChIP-Seq data). It can be run on any laptop by a person without computer background. A friendly graphycal user interface makes easy the operation of the tools. See also seqanswers/S-MART . Taverna Taverna . wapRNA wapRNA . Fusion genes/chimeras/translocation finders/structural variations Genome arrangements result of diseases like cancer can produce aberrant genetic modifications like fusions or translocations. Identification of these modifications play important role in carcinogenesis studies. BreakDancer BreakDancer . See also seqanswers/BreakDancer . ChimeraScan ChimeraScan . EBARDenovo EBARDenovo . FusionAnalyser FusionAnalyser . FusionCatcher FusionCatcher . FusionHunter FusionHunter identifies fusion transcripts without depending on already known annotations. It uses Bowtie as a first aligner and paired-end reads. See also seqanswers/FusionHunter . FusionMap FusionMap . FusionSeq FusionSeq . See also seqanswers/FusionSeq . SOAPFuse SOAPFuse . SOAPfusion Soapf usion . TopHat-Fusion TopHat-Fusion is based on TopHat version and was developed to handle reads resulting from fusion genes. It does not require previous data about known genes and uses Bowtie to align continuous reads. See also seqanswers/TopHat-Fusion . ViralFusionSeq ViralFusionSeq is high-throughput sequencing (HTS) tool for discovering viral integration events and reconstruct fusion transcripts at single-base resolution. See also hkbic/VFS and SEQWiki/VFS . DeFuse DeFuse . Copy Number Variations identification CNVseq CNVseq detects copy number variations supported on a statistical model derived from array-comparative genomic hybridization . Sequences alignment are performed by BLAT, calculations are executed by R modules and is fully automated using Perl. See also seqanswers/CNVseq . RNA-Seq simulators These Simulators generate in silico reads and are a useful tool to compare and test the efficiency of algorithms developed to handle RNA-Seq data. Moreover, some of them make possible to analyse and model RNA-Seq protocols. BEERS Simulator BEERS is formatted to mouse or human data, and paired-end reads sequenced on Illumina platform. Beers generates reads starting from a pool of gene models coming from different published annotation origins. Some genes are chosen randomly and afterwards are introduced deliberately errors (like indels, base changes and low quality tails), followed by construction of novel splice junctions. dwgsim dwgsim . Flux simulator Flux Simulator implements a computer pipeline simulation to mimic a RNA-Seq experiment. All component steps that influence RNA-Seq are taken into account (reverse transcription, fragmentation, adapter ligation, PCR amplification, gel segregation and sequencing) in the simulation. These steps present experimental attributes that can be measured, and the approximate experimental biases are captured. Flux Simulator allows joining each of these steps as modules to analyse different type of protocols. See also seqanswers/Flux . RSEM Read Simulator rsem-simulate-reads . RNASeqReadSimulator RNASeqReadSimulator contains a set of simple Python scripts, command line driven. It generates random expression levels of transcripts (single or paired-end), equally simulates reads with a specific positional bias pattern and generates random errors from sequencing platforms. RNA Seq Simulator RNA Seq Simulator . Transcriptome assemblers The transcriptome is the total population of RNAs expressed in one cell or group of cells, including non-coding and protein-coding RNAs. There are two types of approaches to assemble transcriptomes. Genome-guided methods use a reference genome (if possible a finished and high quality genome) as a template to align and assembling reads into transcripts. Genome-independent methods does not require a reference genome and are normally used when a genome is not available. In this case reads are assembled directly in transcripts. Genome-Guided assemblers Cufflinks Cufflinks . IsoInfer IsoInfer . IsoLasso IsoLasso . RNAeXpress RNAeXpress . Scripture Scripture . See also seqanswers/Scripture . Genome-Independent assemblers KISSPLICE KISSPLICE . Oases Oases . See also seqanswers/Oases . Rnnotator . SOAPdenovo SOAPdenovo . See also seqanswers/SOAPdenovo . Scaffolding Translation Mapping (STM) . Trans-ABySS Trans-AByss . See also seqanswers/Trans-ABySS . Trinity Trinity . See also seqanswers/Trinity . Velvet Velvet (algorithm) . Velvet(EMBL-EBI) . See also seqanswers/Velvet . Visualization tools Artemis Artemis . Apollo Apollo . EagleView EagleView . GBrowse GBrowse . Integrated Genome Browser IGB . Integrative Genomics Viewer (IGV) IGV . GenomeView genomeview . MapView MapView . Tablet Tablet Tbrowse- HTML5 Transcriptome Browser Tbrowse . Savant Savant Samscope Samscope . SeqMonk SeqMonk . See also seqanswers/SeqMonk . Vespa Vespa . Functional, Network Pathway Analysis Tools Ingenuity Systems (commercial) iReport IPA : Ingenuity’s IPA and iReport applications enable you to upload, analyze, and visualize RNA-Seq datasets, eliminating the obstacles between data and biological insight. Both IPA and iReport support identification, analysis and interpretation of differentially expressed isoforms between condition and control samples, and support interpretation and assessment of expression changes in the context of biological processes, disease and cellular phenotypes, and molecular interactions. Ingenuity iReport supports the upload of native Cuffdiff file format as well as gene expression lists. IPA supports the upload of gene expression lists. Gene Set Association Analysis for RNA-Seq ( GSAA-Seq ) : GSAA-Seq is a computational method that assesses the differential expression of a pathway/gene set between two biological states based on RNA-Seq data. Further annotation tools for RNA-Seq data seq2HLA seq2HLA is an annotation tool for obtaining an individual's HLA class I and II type and expression using standard NGS RNA-Seq data in fastq format. It comprises mapping RNA-Seq reads against a reference database of HLA alleles using bowtie , determining and reporting HLA type, confidence score and locus-specific expression level. This tool is developed in Python and R . It is available as console tool or Galaxy module. See also seqanswers/seq2HLA . HLAminer HLAminer is a computational method for identifying HLA alleles directly from whole genome, exome and transcriptome shotgun sequence datasets. HLA allele predictions are derived by targeted assembly of shotgun sequence data and comparison to a database of reference allele sequences. This tool is developed in perl and it is available as console tool. pasa pasa . RNA-Seq Databases queryable-rna-seq-database queryable-rna-seq-database . RNA-Seq Atlas RNA-Seq Atlas . SRA SRA . Webinars and Presentations RNASeq-Blog Presentations RNA-Seq Workshop Documentation (UC Davis University) VIDEO: Strategies for Identifying Biologically Compelling Genes from Breast Cancer Subtype RNA-Seq Profiles with Accompanying Analysis Princeton Workshop Youtube/RNA-Seq NGS Leaders COFACTOR genomics References ^ Wang Z, Gerstein M, Snyder M. (January 2009). RNA-Seq: a revolutionary tool for transcriptomics . Nature Reviews Genetics 10 (1): 57–63. doi : 10.1038/nrg2484 . PMC 2949280 . PMID 19015660 . ^ a b Yang Liao, Gordon K Smyth and Wei Shi (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote . Nucleic Acids Research 41 . doi : 10.1093/nar/gkt214 . PMID 23558742 . ^ http://bioinformatics.oxfordjournals.org/content/29/1/15.full ^ Cole Trapnell, Lior Pachter and Steven Salzberg (2009). TopHat: discovering splice junctions with RNA-Seq . Bioinformatics 25 (9): 1105–1111. doi : 10.1093/bioinformatics/btp120 . PMC 2672628 . PMID 19289445 . ^ Cole Trapnell, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold and Lior Pachter (2010). Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms . Nature Biotechnology 28 (5): 511–515. doi : 10.1038/nbt.1621 . PMC 3146043 . PMID 20436464 . ^ Zerbino DR, Birney E (2008). Velvet: Algorithms for de novo short read assembly using de Bruijn graphs . Genome Research 18 (5): 821–829. doi : 10.1101/gr.074492.107 . PMC 2336801 . PMID 18349386 . zz: http://en.wikipedia.org/wiki/List_of_RNA-Seq_bioinformatics_tools