科学网

 找回密码
  注册

tag 标签: Pipeline

相关帖子

版块 作者 回复/查看 最后发表

没有相关内容

相关日志

使用nf-core的ampliseq(qiime2)流程分析16S数据
zd200572 2020-4-11 13:31
最近看到生信技能树的一篇推文在介绍nf-core这个流程管理工具,发现官方有qiime2的流程,学习一下,顺便探索一下中间的坑。关于nf-core, 这篇推文 已经介绍的够多了,我这里主要学习它的搭建和使用。 一、环境搭建 首先,先进行环境搭建工作,这是必修课和基础,没有环境,什么也做不了。理解下来,nf-core可以使用三种方式进行环境准备,本地安装,conda或者docker,一般来说,对新手最友好的当属conda了,除了有的软件清华源镜像里没有,会速度极慢,容易失败,可能环境准备要放许久,如果数据不大的话,建议选用一台物理地址在香港等地的小云服务器解决,软件安装节省很多很多时间。 #下载conda,境内推荐清华源 #https://mirrors.tuna.tsinghua.edu.cn/anaconda/miniconda/Miniconda3-latest-Linux-x86_64.sh wgethttps://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh #按照提示安装 bashMiniconda3-latest-Linux-x86_64.sh #如果选择不初始化,激活环境 sourceminiconda3/bin/activate #下载流程所需要的环境配置文件 wgethttps://github.com/nf-core/ampliseq/raw/master/environment.yml #创建流程所需要的环境 condaenvcreate-nampliseq--fileenvironment.yml #激活环境 condaactivateampliseq #安装nextflow condainstall-cbiocondanextflow-y 二、配置和运行 配置主要是参考github上这个 流程的参数说明 ,主要是控制16S的扩增引物,电脑的最大CPU核心数和RAM,序列质控trim的长度,先fastqc确定一下。 #配置 cdtest #把数据放在工作目录,这里省略 #配置好sample-metadata.txt样本信息表,下载已经训练好的分类参考 #版本需要对应,这里是2019.10 wgethttps://data.qiime2.org/2019.10/common/gg-13-8-99-nb-classifier.qza #然后运行流程,这里我开了一个虚拟机,双核4g #因为已经切到建好的环境了,就不再加上-profileconda参数了,否则又要新建一个一样的环境 nextflowrunnf-core/ampliseq--readsDong-16S\\ --FW_primerTACGGRAGGCAGCAG\\ --RV_primerAGGGTATCTAATCCT\\ --metadatasample-metadata.txt\\ --untilQ2import\\ --extension/*R{1,2}.fastq\\ --trunclenf280\\ --trunclenr250\\ --max_memory'3.GB'\\ --max_cpus2\\ --onlyDenoising 然后,就得到了输出结果: 给我的感觉是,一个成熟的流程构建者由于对数据处理有丰富的经验,可以充分地利用计算机的硬件最大潜能,实现最短的时间完成最大的任务量,这对于生产环境是用及其重要的,科研环境一般可能不会有这种问题,科研最需要的是画图,和能说明问题的结论以及故事。它充分地合理安排了各个任务,可以步骤交替运行,但基本上没有限速步骤,这是值得学习和使用的地方。 Launching`nf-core/ampliseq` -revision:cd23988d88 processget_software_versions 0of1 processfastqc 0of6 processtrimming 0of6 processmultiqc- processqiime_import- processqiime_demux_visualize- ...... processget_software_versions 1of1#x2714; processfastqc 6of6#x2714; processtrimming 6of6#x2714; processmultiqc 1of1#x2714; processqiime_import 1of1#x2714; processqiime_demux_visualize 1of1#x2714; processdada_trunc_parameter 1of1#x2714; processdada_single 1of1#x2714; processclassifier 1of1#x2714; processfilter_taxa 1of1#x2714; processexport_filtered_dada_output 1of1#x2714; processreport_filter_stats 1of1#x2714; processRelativeAbundanceASV 1of1#x2714; processRelativeAbundanceReducedTaxa 1of1#x2714; processbarplot 1of1#x2714; processtree 1of1#x2714; processalpha_rarefaction 1of1#x2714; processcombinetable 1of1#x2714; processdiversity_core 1of1#x2714; processmetadata_category_all 1of1#x2714; processmetadata_category_pairwise 1of1#x2714; processalpha_diversity 4of4,failed:4#x2714; processbeta_diversity- processbeta_diversity_ordination 4of4#x2714; processprepare_ancom 1of1#x2714; processancom_tax 5of5#x2714; processancom_asv 1of1#x2714; processoutput_documentation 1of1#x2714; Pipelinecompletedsuccessfully NOTE:Process`alpha_diversity(evenness_vector)`terminatedwithanerrorexitstatus(1)--Errorisignored WARN:TorendertheexecutionDAGintherequiredformatitisrequiredtoinstallGraphviz--Seehttp://www.graphviz.orgformoreinfo. Completedat:08-Apr-202006:44:42 Duration:9m24s CPUhours:0.2(2.4%failed) Succeeded:43 Ignored:4 Failed:4 #运行时间9分钟左右,已经超级高效了,我手动做的话会到法小时吧。因为手上数据有些质量问题,处理过程中有报错 三、结果欣赏 来看看这个结果怎样,因为结果做的很漂亮,所以用上了欣赏这个词。基本上相当于一个公司的数据分析报告的感觉,我觉得如果再加上一个网页端,人人都可以云生信做微生物数据分析了。毕竟,16S数据分析也不需要多强大的电脑,自己的笔记本就可以搞定。专注于具体的参数,而不需要考虑每一个命令,这就是未来呀。从运行过程来看,作者还使用了一些R脚本完成了许多图形的绘制,以及部分文件的操作。 #安装tree,查看文件目录树 sudoaptinstalltree tree #以下是输出 ├──Documentation ├──MultiQC ├──abundance_table ├──alpha-diversity ├──alpha-rarefaction ├──ancom ├──barplot ├──beta-diversity ├──demux ├──fastQC ├──phylogenetic_tree ├──pipeline_info ├──rel_abundance_tables ├──representative_sequences ├──taxonomy └──trimmed 1.提供了一个帮助文档,方便理解以上各个文件的信息。 2.然后是结果汇总,是流程的运行概览信息,CPU,内存使用情况和运行时间,以及各个任务的详细信息,包括脚本命令等。 3.关于结果,流程是把qiime2的qzv格式做了解压处理,这样方便直接用网页打开而不需要view.qiime2.cn这个网站。而且对文件进行了重命名,方便进行查阅。和qiime2的输出结果是一样的,这里就不放了。
个人分类: biology|3367 次阅读|0 个评论
关联分析TASSEL软件管道命令
热度 1 wangbingcai2017 2017-4-29 16:32
WIN7下TASSEL5软件在窗口下可以进行多任务处理,但是不能够自动化, TASSEL5软件另外提供了 管道命令(pipeline),可以进行批处理来减少重复性劳动。 管道命令介绍参照附件 Tassel_Pipeline_Tutorial20160330.pdf 。 典型的MLM(混合线性模型)分析管道命令如下: perl run_pipeline.pl -fork1 -h genotype.hmp -filterAlign -filterAlignMinFreq 0.05 注: 导入基因型数据并过滤 -fork2 -r trait.txt 注: 导入表型数据 -fork3 -r pop_structure.txt -excludeLastTrait 注: 导入群体结构数据 -fork4 -k kinship.txt -combine5 -input1 -input2 -input3 -intersect -combine6 -input5 -input4 -mlm -mlmVarCompEst P3D -mlmCompressionLevel None -export result 注: 导入kinship矩阵,合并表型、基因型和群体结构,设定MLM参数 不同的Plugin以灰白背景区分。管道命令的具体用法见附件 Tassel5PipelineCLI.pdf 。 在此基础上结合批处理实现对33个基因型数据的MLM分析: for /l %%i in (1,1,33) do perl run_pipeline.pl -fork1 -h ./hmp/%%i.hmp -filterAlign -filterAlignMinFreq 0.05 -fork2 -r trait.txt -fork3 -r pop_structure -excludeLastTrait -fork4 -k kinship -combine5 -input1 -input2 -input3 -intersect -combine6 -input5 -input4 -mlm -mlmVarCompEst P3D -mlmCompressionLevel None -export result%%i 将以上脚本保存为bat格式,放在TASSEL5的安装目录里,其它数据也放在安装目录中。 Tassel_Pipeline_Tutorial20160330.pdf Tassel5PipelineCLI.pdf
个人分类: 工作|6017 次阅读|1 个评论
pipeline
zoubinbin100 2016-10-10 10:01
bioconda http://bioconda-recipes-demo.readthedocs.io/en/docs/ Experiences with workflows for automating data-intensive bioinformatics Question: Workflow management software for pipeline development in NGS : https://www.biostars.org/p/115745/ https://www.biostars.org/p/91301/ omics_pipe https://pythonhosted.org/omics_pipe/custom_pipelines.html A review of bioinformatic pipeline frameworks https://github.com/pditommaso/awesome-pipeline Awesome Pipeline A curated list of awesome pipeline toolkits inspired by Awesome Sysadmin Pipeline frameworks libraries ActionChain - A workflow system for simple linear success/failure workflows. Airflow - Python-based workflow system created by AirBnb. Anduril - Component-based workflow framework for scientific data analysis. Antha - High-level language for biology. Bds - Scripting language for data pipelines. Bpipe - Tool for running and managing bioinformatics pipelines. Briefly - Python Meta-programming Library for Job Flow Control. Cluster Flow - Command-line tool which uses common cluster managers to run bioinformatics pipelines. Clusterjob - Automated reproducibility, and hassle-free submission of computational jobs to clusters. Compss - Programming model for distributed infrastructures. Conan2 - Light-weight workflow management application. Cosmos - Python library for massively parallel workflows. Cuneiform - Advanced functional workflow language and framework, implemented in Erlang. Doit - Task management automation tool. Dagobah - Simple DAG-based job scheduler in Python. Drake - Robust DSL akin to Make, implemented in Clojure. Flex - Language agnostic framework for building flexible data science pipelines (Python/Shell/Gnuplot). Flowr - Robust and efficient workflows using a simple language agnostic approach (R package). Gwf - Make-like utility for submitting workflows via qsub. Hive - System for creating and running pipelines on a distributed compute resource. Joblib - Set of tools to provide lightweight pipelining in Python. Ketrew - Embedded DSL in the OCAML language alongside a client-server management application. Kronos - Workflow assembler for cancer genome analytics and informatics. Loom - Tool for running bioinformatics workflows locally or in the cloud. Longbow - Job proxying tool for biomolecular simulations. Luigi - Python module that helps you build complex pipelines of batch jobs. Makeflow - Workflow engine for executing large complex workflows on clusters. Mario - Scala library for defining data pipelines. Mistral - Python based workflow engine by the Open Stack project. Moa - Lightweight workflows in bioinformatics. Nextflow - Flow-based computational toolkit for reproducibile and scalable bioinformatics pipelines. NiPype - Workflows and interfaces for neuroimaging packages. OpenGE - Accelerated framework for manipulating and interpreting high-throughput sequencing data. PipEngine Ruby based launcher for complex biological pipelines. Pinball - Python based workflow engine by Pinterest. PyFlow - Lightweight parallel task engine. PypeFlow - Lightweight workflow engine for data analysis scripting. Pwrake - Parallel workflow extension for Rake. Qdo - Lightweight high-throughput queuing system for workflows with many small tasks to perform. Qsubsec - Simple tokenised template system for SGE. Rabix - Python-based workflow toolkit based on the Common Workflow Language and Docker. Remake - Make-like declarative workflows in R. Rmake - Wrapper for the creation of Makefiles, enabling massive parallelization. Rubra - Pipeline system for bioinformatics workflows. Ruffus - Computation Pipeline library for Python. Ruigi - Pipeline tool for R, inspired by Luigi. Sake - Self-documenting build automation tool. SciLuigi - Helper library for writing flexible scientific workflows in Luigi. Scoop - Scalable Concurrent Operations in Python. Snakemake - Tool for running and managing bioinformatics pipelines. Spiff - Based on the Workflow Patterns initiative and implemented in Python. Stpipe - File processing pipelines as a Python library. Suro - Java-based distributed pipeline from Netflix. Swift - Fast easy parallel scripting - on multicores, clusters, clouds and supercomputers. Toil - Distributed pipeline workflow manager (mostly for genomics). Yap - Extensible parallel framework, written in Python using OpenMPI libraries. WorldMake - Easy Collaborative Reproducible Computing. Workflow platforms ActivePapers - Computational science made reproducible and publishable. Apache Iravata - Framework for executing and managing computational workflows on distributed computing resources. Arvados - A container based workflow platform. Biokepler - Bioinformatics Scientific Workflow for Distributed Analysis of Large-Scale Biological Data. Chipster - Open source platform for data analysis. Cromwell - Workflow Management System geared towards scientific workflows from the Broad Institute. Fireworks - Centralized workflow server for dynamic workflows of high-throughput computations. Galaxy - Web-based platform for biomedical research. Kepler - Kepler scientific workflow application from University of California. NextflowWorkbench - Integrated development environment for Nextflow, Docker and Reusable Workflows. OpenMOLE - Workflow Management System for exploration of models and parameter optimization. Ophidia - Data-analytics platform with declarative workflows of distributed operations. Pegasus - Workflow Management System. Sushi - Supporting User for SHell script Integration. Yabi - Online research environment for grid, HPC and cloud computing. Taverna - Domain independent workflow system. VisTrails - Scientific workflow and provenance management system. Wings - Semantic workflow system utilizing Pegasus as execution system. Workflow languages Common Workflow Language Cloudgene Workflow Language OpenMOLE DSL Workflow Definition Language Yet Another Workflow Language Workflow standardization initiatives Workflow 4 Ever Initiative Workflow 4 Ever workflow research object model Workflow Patterns Initiative Workflow Patterns Library ResearchObject.org Literate programming (aka interactive notebooks) Beaker Notebook-style development environment. Binder - Turn a GitHub repo into a collection of interactive notebooks powered by Jupyter and Kubernetes IPython A rich architecture for interactive computing. Jupyter Language-agnostic notebook literate programming environment. Pathomx - Interactive data workflows built on Python. R Notebooks - R Markdown notebook literate programming environment. Wakari - Web-based Python Data Analysis. Zeppelin - Web-based notebook that enables interactive data analytics. Build automation tools Bazel - Build software just as engineers do at Google. DoIt - Highly generalized task-management and automation in Python. Gradle - Unified cross platforms builds. Scons - Python library focused on C/C++ builds. Shake - Define robust build systems akin to GNU Make using Haskell. Make - The GNU Make build system. Other projects HPC Grid Runner noWorkflow - Supporting infrastructure to run scientific experiments without a scientific workflow management system, and still get things like provenance. https://github.com/nextflow-io/awesome-nextflow
个人分类: 生物信息|9 次阅读|0 个评论
几个小技巧
rasin 2015-12-10 15:48
以下是在学习中碰到的小问题,列在这里供大家参考: 1、Pipeline Pilot 中的Html table viewer等组件可以调用默认的浏览器查看网页文件,但是在升级到win8以后,该组件失效。可以通过修改 HKEY_CLASSES_ROOT\Wow6432Node\CLSID\{0002DF01-0000-0000-C000-000000000046}\LocalServer32 键值达到目的。 实际上该组件依靠“internetexplorer.application $(HTML Filename)”命令调用,因此可以将这个组件替换为本地拷贝,然后修改其下层的run program组件中的命令为默认浏览器的命令名即可,必要时包含路径。 此外,理论上可以在system32系统目录中建立浏览器的软链接实现(见前面博文)。 此方法也应该适用于Excel等viewer。 2、如果需要将PowerPoint中的图片导出,可以将ppt保存成pptx,然后用压缩软件打开pptx,那么,你可以在media目录发现你要的图片。
2593 次阅读|0 个评论
Pipeline Network Coding for Multicast Streams
hongyanee 2011-10-23 09:17
文章研究了 pipeline network coding 的多播流分布在高丢包率的场景下的性能。以前的网络编码研究集中于 batch network coding ,必须等待这一批数据包到达之后一起编码。而 pipeline network coding 的思想是随着包的到达可以即时的编码。这样的话 pipeline network coding 的好处是:减少了编码时延;进一步提高了吞吐量;对于高层是透明的;不需要特殊的硬件;易于部署。 1. 在无线通信环境下, MANETs 易于受到信道误码、干扰和拥塞的影响,通常由两种方法用于误码恢复: Forward Error Correction 和 ARQ 。在实时的多播流中 ARQ 是不合适的,而 erasure coding 可以通过引入编码冗余实现纠错。 In network coding, 也是通过将几个数据包线性编码为几个线性组合。两者的不同是 erasure coding 是在 source node 执行 encoding ,而 network coding 在 intermediate node 实现。 2. 所以研究人员引入了 batch network coding ,也就是将一批数据包在 source and relay nodes 实现随机线性编码。但是这种方案有两个缺点: 1 )引入了编码和解码的时延,时延随着这批数据包的大小增长。 2 )这批数据包只有在足够多的线性独立的数据包组合到达,才能解码。 3. 为了解决 batch coding 的问题,文章提出了 pipeline coding 。不是等待这批数据包全部到达之后在进行编码,而是每到一个新的数据包就生成一个新的线性组合,还没到的数据包系数记为 0 。这样可以在到达一个新的数据包时就可以进行编码或者解码。这样做的好处是。 1 )获得了很低的编码和解码时延。 2 )提高了吞吐量。 3 )对于高层是透明的。 4 )不需要特殊的硬件。
个人分类: 笔记|5786 次阅读|0 个评论
Pipeline & SlideShare
freton 2010-9-2 21:36
Part I: Pipeline,bothadopted in industrial assembly andprocessors, is used for performance enhancement or improvement. In addtion, in processors, superscalar, multi-issues, reorder buffer, cache, etc., are all employed to improve performance. Pipeline control logic: bypass unit --- forwaring, data dependency branch predict unit --- prediction, control dependency Part II: Two places for sharing: SlideShare: as it states, it is the best way to share presentations, documents and professional videos http://www.slideshare.net/ Zoho http://viewer.zoho.com/Upload.jsp After uploading what we want to share, we can get corresponding links for them and copy URLs into Google Translate and let the smart tool finish translating the resources into what language we are interested in.
个人分类: MyDailyScience|2832 次阅读|0 个评论

Archiver|手机版|科学网 ( 京ICP备07017567号-12 )

GMT+8, 2024-6-2 04:02

Powered by ScienceNet.cn

Copyright © 2007- 中国科学报社

返回顶部