科学网

 找回密码
  注册
科学网 标签 Sam

tag 标签: Sam

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

sam文件小知识:正负链reads在sam文件中的序列信息
hayidahubei 2020-3-10 06:16
Sam文件的第二列: FLAG:0正链,16负链,4没比对上 如果一个read mapping到正链上,sam文件第十列所展示的是这个read的序列。而如果一个read mapping到负链上,sam文件第十列所展示的是这个read的反向互补序列。 下面展示的是sam文件的两行(前10列): K00315:238:HF3Y3BBXY:2:1211:29031:44728 0 chr10 360997 42 50M * 0 0 TGGTTGGAAGCTGGGGCCCCGGGGCAGGGGACGTCTGCTAAGCTGCGTAT K00315:238:HF3Y3BBXY:2:2215:31142:22221 16 chr10 361681 40 50M * 0 0 CCATTATAAATCTTCATACTACAGAAACAGCCTGGGCAGAGCAACTGCCT 第一行第二列值为0, 意味着这行的read mapping到基因组的正链上。直接在原始fastq中搜索序列“TGGTTGGAAGCTGGGGCCCCGGGGCAGGGGACGTCTGCTAAGCTGCGTAT”,可以找到这个read 第二行第二列值为16, 意味着这行的read mapping到基因组的负链上。直接在原始fastq中搜索序列“CCATTATAAATCTTCATACTACAGAAACAGCCTGGGCAGAGCAACTGCCT”,找不到这个read. 它反向互补序列“AGGCAGTTGCTCTGCCCAGGCTGTTTCTGTAGTATGAAGATTTATAATGG”可以在原始fastq文件中被搜索到。
个人分类: RNA_seq处理|7460 次阅读|0 个评论
Qualimap2.多样本高通量数据高级QC工具
热度 1 luria 2017-8-10 12:59
今天推荐一款叫做Qualimap2的软件,它可以对reads mapping的结果进行统计,如果是RNAseq数据,甚至还可能算出counts并进行简单的后续分析。此外,输入的注释文件可以是gff/gtf/bed,相比于其它同类软件使用起来更加方便。 ========================================================================== Qualimap2 的安装: 从官网 http://qualimap.bioinfo.cipf.es/doc_html/intro.html 下载 Qualimap2 (当前最新版为 v2.2.1 ),因为它是用 java 写的,可直接运行。但是程序调用了一些 R 包,需先做如下安装: R # 进入 R 环境 install.packages(“optparse”) 在弹出的镜像选择列表中选 China(Beijing) !] 。可直接安装。 source( “ http://bioconductor.org/biocLite.R ” ) biocLite( c(“ NOISeq ”, “ Repitools ”, “ Rsamtools ”, “ GenomicFeatures ”, “ rtracklayer ” ) ) 注:也可以使用其自带的安装程序安装,但是 自动安装通常会报错 ,因此不建议。 自动安装代码如下: cd qualimap_v2.2.1 Rscript scripts/installDependencies.r ========================================================================== Qualimap2 的使用: 图形界面版的没有什么好说的,直接运行qualimap即可 . 因为处理的数据量通常会比较大(多),一般会在 linux 下批量运行,因此本博文重点关注命令行操作代码。 1. bamqc 方法 对 bam 文件进行 QC : ./qualimap bamqc -bam xx .srt.bam -outdir bamqc_result -outformat PDF:HTML # -bam 指定 bam 文件路径, -outdir 指定输出的文件夹, -outformat 指定输出的报告格式,可选 PDF/HTML/PDF:HTML ,这里建议 PDF:HTML ,表示同时生成两种格式的报告 # 不同软件进行比对时可能对 reads 的处理不同 (例如是否保留未比对上的 reads ,以及 multi-reads 的处理等) 。因此在 使 用 Qualimap 2 的时候有些信息可能有误,如 本人用 hisat2 做 reads mapping ,在 Qualimap2 统计 ATCGN 时, N 超过 100% ,这显然不可能,这可能是算法 上存在 bug 。 不过绝大多部结果还是可信的, 这里仅列出几项最重要的,大家可以自行查看生成的网页版报告(下同): 比对结果信息汇总表 Insert Size 统计图 (点击可查看大图,下同) 从统计图中可获取到文库约为 350bp 如果有 reference 对应的 注释 文件则 ./qualimap bamqc -bam xx .srt.bam -gff yy .gtf -oc count.mat rix -outdir bamqc -outformat PDF:HTML # -bam 指定 bam 文件路径; -gff 指定参考基因组注释信息,输入的格式可以是 GFF/GTF/BED ; -oc 保存有 read 覆盖位点的深度信息于此文件中(没有覆盖的位点不写出); -outdir 指定输出文件夹; -outformat 指定生成报告格式。 生成的 报告中 会在上述的基础上增加一个区域内信息汇总表 如下: 区域内信息汇总表 2. rnaseq 方法 对 RNA 比对数据进行 QC : 需要提供 bam 文件 以及 注释文件用于生成 基因表达均一化 图 等 ./qualimap rnaseq -bam xx .srt.bam -gtf yy .gtf -oc count.matrix -outdir rnaseq_ result -pe -outformat PDF:HTML # 其它选项与上述 bamqc 相同,仅增加 -pe 表示是 PE 测序的 # 虽然代码相似但 运行完生成的结果文件与 bamqc 相差较大 (基于此可以联合 rnaseq 和 bamqc 使用) , 主要 包括以下内容: 基本比对情况汇总表 比对到的区域注释,同时还会给出对应的图 5’ 端 3’ 端 bias 情况图 新转录本占比情况 3. multi-bamqc 方法 Qualimap2 还可以对多个 bam 文件进行批量统计 。 先构建一个 mapfile 1 .txt 文件如下: 各列用 tab 分开, 第一列是样品名,第二列是样品对应的 bam 文件所在位置 (如上图,也可以输入 bamqc 方法的结果文件夹,建议直接用 bam 文件作为输入) ,第三列是分组信息 。 如果各样本独立,每个 Group 都不一样;如果是一组,写什么文字无关紧要,同一组文字相同即可。 ./qualimap multi-bamqc -r -d mapfile 1 .txt -gff xx .gtf -outdir multi_bamqc -outformat PDF:HTML # multi-bamqc 可以利用 bamqc 方法的结果也可以直接输入 bam 文件,流程化完成,如果输入的是 bam 文件,则需指定 -r 选项。 -d 指定上述 mapfile1.txt 文件路径。其它选项与 bamqc 相同。 # 运行完生成的报告中,统计了一些内容如下(以下未列出的可查看生成的网页版报告): 各样本的整体统计表 各物种比对情况 PCA 图 比对重复率情况 各样品 Insert Size 统计图 在网页版的输出结果中,每个图都可以通过右键查看和保存。 4. comp-counts 方法 可以利用 Qualimap2 生成 counts matrix ,具体代码如下: awk ‘{print \\$1,\\$2}’ mapfile1.txt | while read a b; do ./qualimap comp-counts -bam \\$b -gtf xx.gtf -out \\$a.count -pe; done # 运行完成后生成一系列 count 文件, head 查看结果如下: 5. counts 方法 使用 counts 方法前需生成样品信息文件,这里取名为 mapfile2.txt ,软件说明中写到,到目前最新版,仅能对两组数据进行比较,如果有多组会报错。 mapfile2.txt 的格式如下: 第一列是样品名,第二列是组别(具体用什么文字无关紧要,能区分样品组别关系即可),第三列是 counts 文件的路径,可以是 multi-bamqc 生成的 counts 文件(也可以是其它软件生成的,但为了衔接流畅,建议用 multi-bamqc 生成 counts ),第四列指定在 counts 文件中哪一列记录 count 数据。 ./qualimap counts -c -d mapfile2.txt -outdir counts_result -outformat PDF:HTML # -c 表示两组间进行比对; -d 指定上述 mapfile2.txt 文件的路径,其它选项同 bamqc # 完成后在 counts_result 文件夹 中生成一些报告,其中 ComparisonReport.html 是按组分析的结果 两组 counts 分布盒形图 低表达敏感性分析 GlobalReport.html 是所有样品画到同一张同中,不分组。报告中,有个 Options 表,其中可以看出 Counts threshold 为 5 ,表明程序默认的 counts 阈值为 5 ,这个值可以通过 -k 参数调节。 counts 的密度分布图 注:横坐标为 log2(counts) 样本间相关情况散点图 各样本测序饱和度分析 各样品的表达量盒形图
个人分类: RNA-seq|17083 次阅读|2 个评论
sam格式
liujd 2017-4-16 20:22
个人分类: 生物信息|9 次阅读|0 个评论
Parsing of CIGAR and MD tag in SAM/BAM format
ljxue 2015-1-1 01:05
Source code cigar http://cpansearch.perl.org/src/MNSMAR/GenOO-1.4.5/lib/GenOO/Data/File/SAM/Cigar.pm MDZ http://cpansearch.perl.org/src/MNSMAR/GenOO-1.4.5/lib/GenOO/Data/File/SAM/MDZ.pm Cigar and MDZ http://cpansearch.perl.org/src/MNSMAR/GenOO-1.4.5/lib/GenOO/Data/File/SAM/CigarAndMDZ.pm
个人分类: Bioinformatics|5448 次阅读|0 个评论
[转载]Sam Roweis---Local Linear Embedding 局部线性嵌入 非线性降维
hestendelin 2013-10-27 14:34
原文: http://www.cs.nyu.edu/csweb/People/samroweis.html Sam Roweis (1972-2010) Scientist and Engineer The Department of Computer Science, the Courant Institute, and New York University mourn the untimely death of Professor Sam T. Roweis, on January 12, 2010. Sam was a brilliant scientist and engineer whose work deeply influenced the fields of artificial intelligence, machine learning, applied mathematics, neural computation, and observational science. He was also a strong advocate for the use of machine learning and computational statistics for scientific data analysis and discovery. Sam T. Roweis was born on April 27, 1972. He graduated from secondary school as valedictorian of the University of Toronto Schools in 1990, and obtained a bachelor's degree with honours from the University of Toronto Engineering Science Program four years later. His first exposure to AI and neural computation occured when--as an exceptional undergraduate--he took the graduate-level Neural Network course taught by Geoffrey Hinton . Here Sam discovered what would become his lifelong interest: unlocking the mysteries of intelligence; motivating all his work was a dream to understand human intelligence, and to build intelligent machines. In 1994 he joined the Computation and Neural Systems PhD program at the California Institute of Technology , working under the supervision of John J. Hopfield. Sam made several contributions to the then-nascent field of molecular and DNA computing. With his contemporary Erik Winfree and others, he made a proposal for a sticker-based model of computation. But the main topic of his thesis was speech recognition, time-series analysis, and dynamical systems modeling. A central theme of his research was the systematic use of probabilistic frameworks to formulate and analyze learning algorithms. He realized that non-linear dynamical systems could be learned using the Expectation-Maximization ( EM ) algorithm. He proposed a variation of the well-established Hidden Markov Model ( HMM ) method for speech recognition, and he showed how a new form of Independent Component Analysis ( ICA ) could be used to separate multiple audio sources from a single microphone signal. He also realized that Principal Components Analysis ( PCA ) could be re-interpreted as the limit of a probabilistic model. His PhD research culminated with the publication of a landmark 1999 article, co-authored with Zoubin Ghahramani, that demonstrated that HMM, ICA, PCA, and Kalman Filters can all be seen as variations on a single linear Gaussian model. After earning his PhD in 1999, Sam took a postdoctoral position in London with the Gatsby Computational Neuroscience Unit founded by Geoff Hinton. Sam's enthusiasm and creativity played an important role in making the Gatsby Unit one of the top labs in computational neuroscience. At Gatsby, Sam started an incredibly fruitful long-distance collaboration with Lawrence Saul (then at ATT Labs in New Jersey), which led to the Locally Linear Embedding algorithm ( LLE ). The LLE paper, published in Science in 2000, revolutionized the field of dimensionality reduction, and gathered over 2700 citations in less than 10 years. It spurred an entire new sub-field of machine learning, called manifold learning, and gathered a considerable amount of interest from other technical fields, including applied mathematics. With LLE, Sam and Lawrence taught us to think globally and fit locally: Given points in a high-dimensional space, local geometric relations among groups of nearby data points capture both local and global structure in the whole data set. This permits organization, visualization, and search for large, complex data collections. The method has had numerous applications in data visualization for biology, neuroscience, and the social sciences. After his postdoc, some time at MIT, and a stint with the startup company WhizBang! Labs, Sam took a faculty job at the University of Toronto, to which Geoff Hinton had returned. In making this choice Sam rejected several extremely prestigious offers for the unparalleled intellectual atmosphere he found at Toronto surrounding his mentor, Geoff. In this period, two new unsupervised methods he developed were Stochastic Neighborhood Embedding (SNE) and Neighborhood Component Analysi s (NCA). Both methods use the idea of learning a function that maps datapoints into a space in which semantically similar objects are nearby, while semantically dissimilar objects are far apart. SNE has become a popular method for visualizing and organizing high-dimensional data, while NCA has spurred a resurgence of interest in methods that learn similarity metrics from data. Sam published a set of papers on speech and signal analysis, particularly using factorial HMM and hierarchical models . He was appointed to a Canadian Research Chair in statistical machine learning, and received a Sloan research followship in 2004. In 2005 Sam spent a semester at MIT. Capitalizing on his work on blind source separation, he co-authored a landmark 2006 SIGGRAPH paper with Rob Fergus and others on removing camera shake from a single photograph. It was during his stay in Cambridge, MA that he met his wife, Meredith Goldwasser. While at MIT and upon his return to Toronto, he focused on using machine learning and statistical methods to contribute to other sciences, such as astronomy and biology. He started an extremely fruitful collaboration with NYU astronomer and secondary-school friend David W. Hogg. Their most visible success together was a kind of search engine for the sky, called Astrometry.net. The system can take any picture of the sky from any source, and instantly identify the location, orientation, and magnification of the image, as well as name each object (star, galaxy, nebula) it contains. Sam and David introduced astronomers to a number of large-scale statistical methods that enabled increasingly automated and precise data analysis. One of their methods can even estimate the year at which an image was taken by measuring tiny variations in stellar positions. In 2006 he was named a fellow of the Canadian Institute for Advanced Research (CIfAR) and received tenure at Toronto. Sam was not just a scientist, however, he was also an engineer. When Meredith took a job at Genentech in San Francisco, Sam took an opportunity to have a more direct impact on the world by joining Google's research labs in San Francisco and Mountain View in 2007. He was fond of saying that Google's search engine is one of the closest things we have to an intelligent computer. In the summer of 2008, Sam and Meredith's twin daughters, Aya and Orli, were born in San Francisco. They were born very prematurely and had to be kept in intensive care unit for many weeks. Sam took an extended paternity leave from Google to take care of the twins. Sam's stay at Google, and the success of his computational astronomy work, had renewed his interest in academic research. He decided to join the Computer Science Department at NYU's Courant Institute as an Associate Professor, and moved the family to New York City in September 2009. At NYU, his collaboration with David Hogg redoubled, as did his ongoing collaborations with Rob Fergus and Yann LeCun . His passing leaves many open threads, and many projects unfinished; at the time of his death he was working on beautiful, simple, but game-changing ideas for astronomical data analysis and remote sensing. Sam had a singular gift: to him, any complex concept was naturally reduced to a simple set of ideas, each of which had clear analogies in other (often very distant) realms. This gift allowed him to explain the key idea behind anything in just a few minutes. Combined with contagious enthusiasm, this made him an unusually gifted teacher and speaker. His talks and discussions were clear and highly entertaining. His tutorial lectures on graphical models and metric learning, available on video at videolectures.net, have been viewed over 25,000 times. He would often begin group meetings by giving a puzzle, the solution of which was always beautiful, enlightening, or hilarious. Many members of the research community became friends with Sam, because of his warm and friendly personality, his communicative smile, and his natural inclination to engagement and enthusiasm. Sam inspired many students to pursue a career in research, and to focus their research on machine learning and artificial intelligence. Already in his short time at NYU, Sam had become a key member of the computer science department, thanks to his broad interests, his clear-sightedness, his sense of humor, his warmth and his infectious enthusiasm. He was also a loving and devoted father to the twins and husband to Meredith. Sam is greatly mourned by his colleagues and students at NYU, who extend their sympathy to his many friends in the broader research community, especially at the University of Toronto, the Gatsby Neuroscience Unit, and Google Research. Most of all, we express our deepest sympathy to his wife Meredith, his twin baby daughters Aya and Orli, and his father Shoukry. The Sam Roweis Memorial Blog is collecting memories of Sam by friends and colleagues. The Sam Roweis Memorial Fund was setup by the NIPS Foundation to support the care and well being of his family. Sam Roweis's home page and publications
个人分类: 转载|3406 次阅读|0 个评论
空间自相关及其SAM软件使用
热度 2 mengchanghe 2013-9-20 15:17
空间自相关 是指一些变量在同一个分布区内的观测数据之间潜在的相互依赖性 ,如生物多样性指数较高只是因为周边的值较高影响所致,所以要尽量避免这一效应,虽然最近有文章探讨排除与否好像不是很大…… 言归正传,排除的方法很多,在 R 里面有相应的包,但是往往数据整理需要一定功夫,不如 SAM( http://www.ecoevol.ufg.br/sam/ ) 软件来的快,所以本文讲述一下如何用 SAM 软件来做。 1. 数据存成规定的模式,第一行是变量和反应量等信息的名称,之后是具体的数据,尤其要注意的是要有经纬度(或样地编号成矩阵排列的) 2. 首先检验是否有空间自相关的影响(每个反应量都要进行这一步),点击“ Modelling ” ---- “ Spatial Eigenvector mapping ”,选择反应量如“ Species richness ”,检测 Longitudinal coordinate(X) 等是否正确,点下面的“ C alculate ” ---- “ Compute ” , 如若提示“ Problemsfound during …… ”则表示无空间自相关影响可进行一般的模型拟合即可,如若无提示,则表明存在,可进行下一步操作“ spatial filters ”这时候软件自动选择了可能会有影响的 spatial filters , 点击“ save ” --- “ OK ”,这样就自动储存了你的 spatial filters 了,做模型的时候要用到的。 注: 检测空间自相关可以直接用“ structure ” ---“ Moran’s I correleogram ” ,看结果中的值判定。不过建议用以上方法,因为上面的结果可以直接用到。 3. 下一步是模型选择: “ Modeling ” ----“ Model Selection and Multi-Model Inference ”, 点击左边框中上栏中的反应量如“ species ”,将左边框中下栏中的所有可能的解释因子放到“ Predictor variable(s) to be selected ”框内,讲步骤 2 中保存的 spatialfilters 放到“ Predictor variable(s) present in all models ”内(表明空间自相关的因素被所有模型考虑并排除),点击“ Compute ”。 注:如数据量大的话,建议用高配置的电脑,运算会快很多。 4. 这样就得到很多模型了,可选择最好的模型,记住最好模型中的自变量等,进行下一步操作: “ Liner regression analysis ” ----- 左边框上栏选中反应量如“ species ”,在 1st Predictor Set 下的方框内将你要的解释量加进去,如若你的数据分为几个大部分,如环境变量,种群压力等,可分别将变量放在 1st Predictor Set , 2nd Predictor Set , 3rd Predictor Set 之下的方框内,可检验每个部分的解释量及其之间的公共解释量。在最后一个 Predictor Set 下放置保存的 spatial filters 的那些因子。 点击“ Compute ” , 即完成。 5. 结果查看部分: “ Analytical Results ”可以查看总的解释度,然后“ Partial Regression ”可以查看每部分及其相互作用之间的解释度。 6. 依次对其他的反应量进行此类操作 声明:此方法学自 Ferry Slik 教授。
个人分类: 软件的细枝末节|8267 次阅读|2 个评论
sam文件格式
liujd 2013-6-30 10:13
个人分类: 生物信息|0 个评论
根据bwa比对的SAM初步估计片段大小
kangyu 2013-4-16 16:56
Linux命令即可,较快,但是不太严格
个人分类: Bioinformatics|5322 次阅读|0 个评论
GeneSelector: Stability and aggregation of ranked gene lists
chuangma2006 2012-11-3 02:44
GeneSelector is a R package providing over 10 statistical methods (e.g., Fold Change, Limma, SAM, etc) for identifying differentially expressed genes. More details can be found from: http://www.bioconductor.org/packages/release/bioc/manuals/GeneSelector/man/GeneSelector.pdf References Boulesteix AL and Slawski M. Stability and aggregation of ranked gene lists.Brief Bioinform (2009) 10 (5): 556-568. Stiglic G, Bajgot M, and Koko P. Gene set enrichment meta-learning analysis: next-generation sequencing versus microarrays. BMC Bioinformatics. 2010; 11: 176.
个人分类: Tools|3322 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-17 13:34

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部