在数据处理或批量文件处理中,常有对多个小文件合并到一个大文件中的需求,以下是两段小程序,记录在此以备用。 这是我用 VB 写的程序,电脑上没有编程软件时可用 Excel ,在 Excel 的“宏”模块写下如下代码来执行,实现把某文件夹下的所有文本文件合并到一个 aaa3.txt 文件中,具体文件夹或输出格式可照程序修改: Sub Macro1() Dim fso, fd, fi, ffs Dim FileNumber, i, j As Integer Dim Mystring, strSortFile() As String FileCount = 0 FileNumber = FreeFile Set fso =CreateObject(Scripting.filesystemobject) Set fd =fso.getfolder(E:\\prg_python\\temp) Set ffs = fd.Files For Each fi In ffs FileCount = FileCount + 1 Next If FileCount = 0 Then GoTo ErrPrc Dim Aaa As String ReDim strSortFile(0 To FileCount) For Each fi In ffs Aaa = fi.Name If Aaa strSortFile(i) Then strSortFile(i + 1) = Aaa Else j = i Do strSortFile(j + 1) =strSortFile(j) j = j - 1 Loop Until Aaa = strSortFile(j) strSortFile(j + 1) = Aaa End If i = i + 1 Next j = 1 OpenE:\\prg_python\\temp\\aaa3.txt For Output As #5 For i = 1 To FileCount Open fd \\ strSortFile(i) For Input As FileNumber Print #5, __________________ Print #5, strSortFile(i) Print #5, ______________ While Not EOF(FileNumber) Line Input #FileNumber, Mystring Print #5, Mystring Wend Close #FileNumber j= j + 1 Next i Close #5 Exit Sub ErrPrc: MsgBox File Error Chr(10) Select CorrectFile Directory , , ERROR End Sub 以下是用 Perl 来实现同样的功能,把文件某文件夹下 .py 文件合并到 kang02.txt 中,程序相对就短一点了。 $aa=E:\\\\prg_python; chdir $aa; @FILES=`dir *.py/b`; my $fff; open(F_ALL,kang02.txt); for($j=0;$j@FILES;$j++){ $fff=$FILES ; chomp($fff); print F_ALL#============.$fff.=========.$j.====\\n; open(F_PL,$fff) or diecannot open\\n; while($line=F_PL){ print F_ALL $line; } close(F_PL); } close(F_ALL); print \\n;
科学网对Markdown排版支持较差,对格式不满意的用户请跳转至 CSDN 或 “生信宝典”公众号 阅读; 想了解更多宏基因组、16S分析相关文章,快关注“宏基因组”公众号,干货第一时间推送。 系统学习生物信息,快关注“生信宝典”,那里有几千志同道合的小伙伴一起学习。 前言 “工欲善其事必先利其器”,生信工程师每天写代码、搭流程,而且要使用至少三门编程语言,没有个好集成开发环境(IDE,Integrated Development Environment)那怎么行? 本人使用过vim, editplus, ultraedit, notepad++, sublime。感觉在多语言支持、直接远程编辑脚本、启动速度等方面还是editplus用着比较舒服,适合我的个人习惯。 Editplus 下载和安装 最好官网下载最新版4.3,喜欢的话正版才30$,关键是不注册也不影响使用。 https://www.editplus.com/download.html 有32/64位版,建议安装64位版 epp430_64bit.exe ,还有中文版(不建议,全是老版本),英语拼写检查(安装了没看到效果); 先安装完成后,打开,会出现配置设置、语法文件位置选择,如下图 建议修改到自己的目录,方便管理和备份,如改为C:\Users\woodc\Desktop\home\soft\editplus 如果不想看到试用字样,百度可以找到很多注册机/注册码,很容易激活。 添加Perl语言模板 该程序对Perl语法默认支持已经非常好了,只是缺少个生信专用模板,参考我的上篇文章 生信人写程序1. Perl语言模板及配置 右键另存下载perl模板文件 直接单击可能会报错,因为Perl的pl文件是也属于网页的一种,会被运行,而内容又不是网页,所以报错。 主要操作如下:将《Perl语言模板》原文中代码复制到editplus中新建的空白文件,点保存; 第一种情况:如果刚才设置了新的模板目录,请选择你自己设置的目录,替换template.pl。 第二种情况:没有更改配件文件目录,默认的保存位置可替换template.pl即可。 如果下次使用新建Perl不能自动加载模板,可以尝试将模板代码保存为template.pl在任何位置,选择Tools - Preference - template — Perl,更改template.pl文件位置为刚才保存的模板template.pl文件即可。 以后点新建- perl会自己加载我们配置的模板开使写新程序;其实我们更多是找写过相近的程序再修改,这个过程是逐渐积累的,领域和用途不同,自己的常用功能也是很个性化的。 添加Shell语言支持 https://www.editplus.com/others.html 选择* Shell stx - 肖俊斌 (2011-06-21)下载,解压后有shell.stx语法文件放在之前设置的目录;也可 直接右键点我下载shell语法 再选择 Tools — Preference — Setting syntax, Add - 输入 “Shell” — OK, 文件扩展添”sh”,语法文件选择下载的shell.stx;点OK; Shell写作模板 主要包括命令行参数解析、默认参数设置、程序功能描述及帮助文档等 右键另存下载Shell模板文件 #!/bin/bash set -e # 设置程序参数的缺省值,少用参数即可运行 # Default parameter input=input.txt output=output.txt database=database.txt execute='TRUE' # 程序功能描述,每次必改程序名、版本、时间等;版本更新要记录清楚,一般可用-h/-?来显示这部分 # Function for script description and usage usage() { cat EOF 2 Usage: ------------------------------------------------------------------------------- Filename: template.sh Revision: 1.0 Date: 2017/6/24 Author: Yong-Xin Liu Email: yxliu@genetics.ac.cn Website: http://bailab.genetics.ac.cn/ Description: This script is solve parameter read and default Notes: Function of this script ------------------------------------------------------------------------------- Copyright: 2017 (c) Yong-Xin Liu License: GPL This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. If any changes are made to this script, please mail me a copy of the changes ------------------------------------------------------------------------------- Version 1.0 2017/6/24 # 输入输出文件格式和示例,非常有用,不知道格式怎么准备文件呀 # Input files: input.txt, can inclue many file # 1. input.txt, design of expriment SampleIDBarcodeSequencegroupgenebatchdescription WT.1TAGCTTWTggps9.102double mutant of ggps9-ggps10, cause A/B down WT.2GGCTACWTggps9.102double mutant of ggps9-ggps10, cause A/B down WT.3CGCGCGWTggps9.102double mutant of ggps9-ggps10, cause A/B down # 2. database.txt, annotation of gene IDdescription AT3G48300Transcript factor # Output file 1. Annotated samples DE genes SamplesIDdescription WtAT3G48300Transcript factor 2. Volcano plot: vol_otu_SampleAvsSampleB.pdf # 参数描述,写清功能的缺省值 OPTIONS: -d database file, default database.txt -i input file, recommend must give -o output file or output directory, default output.txt -h/? show help of script Example: template.sh -i input.txt -d database.txt -o result.txt EOF } # 解释命令行参数,是不是很面熟,其实是调用了perl语言的getopts包, # Analysis parameter while getopts d:h:i:o: OPTION do case $OPTION in d) database=$OPTARG ;; h) usage exit 1 ;; i) input=$OPTARG ;; o) output=$OPTARG ;; ?) usage exit 1 ;; esac done 将以上代码保存为template.sh,点击Tools — Preference — Template — Add 命名为Shell,选择template.sh文件,OK。 以后点New file, 选择shell即自动加载模板; R语言的语法支持 官网下载* R programming language stx - Wei Wang (2007-05-15),或 点我下载R语法文件 Tools — Preference — Setting syntax, Add - 输入 “R” — OK, 文件扩展添”r,R,Rmd”,语法文件选择下载r的stx;点OK; 现在打开个R文件试试,已经语法高亮了 如果有Rstudio server的小伙伴,建议直接用网页版Rstudio在服务器上调式; 远程编辑脚本 先添加远程打开和保存工具栏按钮 Tools - Preference - Tools bar ; 把左侧的Open Remote, Save as Remote 选中按右箭头添加右侧;再选择edit菜单的Line Comment, Line Uncoment添加右侧;把右侧的远程打开和保存托至顶部,常用在前面好找;点OK确定配置 连接服务器打开文件编辑 点Open Remote按扭,点Setting设置远程服务器信息,添加服务器名称、IP、账号和密码,再点Advance中选择Encryption为sftp,OK再OK;即可正常连接服务器并浏览文件,我们选择编码Encoding为UTF-8,再打开shell脚本; 编辑吧,保存自动为远程保存,可以随时保存后马上运行调试,非常方便; 下次再打开已经使用过的文件,记得文件-最近打开文件选择更方便。
# Check the Perl version installed In the terminal, type in: $ perl -v This is perl 5, version 24, subversion 0 (v5.24.0) built for darwin-thread-multi-2level Copyright 1987-2016, Larry Wall Perl may be copied only under the terms of either the Artistic License or the GNU General Public License, which may be found in the Perl 5 source kit. Complete documentation for Perl, including FAQ lists, should be found on this system using man perl or perldoc perl. If you have access to the Internet, point your browser at http://www.perl.org/, the Perl Home Page. # Summary of Perl configuration $ perl -V Summary of my perl5 (revision 5 version 24 subversion 0) configuration: Platform: ... Compiler: ... Linker and Libraries: ... Dynamic Linking: ... Characteristics of this binary (from libperl): Compile-time options: HAS_TIMES MULTIPLICITY PERLIO_LAYERS PERL_COPY_ON_WRITE PERL_DONT_CREATE_GVSV PERL_HASH_FUNC_ONE_AT_A_TIME_HARD PERL_IMPLICIT_CONTEXT PERL_MALLOC_WRAP PERL_PRESERVE_IVUV PERL_USE_SAFE_PUTENV USE_64_BIT_ALL USE_64_BIT_INT USE_ITHREADS USE_LARGE_FILES USE_LOCALE USE_LOCALE_COLLATE USE_LOCALE_CTYPE USE_LOCALE_NUMERIC USE_LOCALE_TIME USE_PERLIO USE_PERL_ATOF USE_REENTRANT_API Built under darwin Compiled at May 11 2016 09:17:03 @INC: /opt/local/lib/perl5/site_perl/5.24/darwin-thread-multi-2level /opt/local/lib/perl5/site_perl/5.24 /opt/local/lib/perl5/vendor_perl/5.24/darwin-thread-multi-2level /opt/local/lib/perl5/vendor_perl/5.24 /opt/local/lib/perl5/5.24/darwin-thread-multi-2level /opt/local/lib/perl5/5.24
需要解决的问题是 求 # y=x*sin(10*pi*x)+2 # x-seq(-1,2,0.01) 在x的范围内y的最大值对应的x,精确到4位小数 属于单基因的遗传算法,交叉不能带来新的基因 ################################################################# 生物和计算的交叉,给计算引入了遗传算法 染色体 基因 突变 交叉(重组) 选择 计算和生物的交叉,带来了计算机辅助药物设计 #################### #!/usr/bin/perl -w #use strict; # x-seq(-1,2,0.01) use AI::Genetic; my $ga = new AI::Genetic( -fitness = \fitnessFunc, -type = 'rangevector', -population = 500, -crossover = 0.9, #一个基因的话 没有交叉的必要0.9 -mutation = 0.1, -terminate = \terminateFunc, ); my @gene1; $ga-init( ]);#基因的数目 $ga-evolve('rouletteTwoPoint', 100); # 选择的方法 一共进化的代数 my $individual=$ga-getFittest(1); my $resultx=$individual-genes(); $resultx=@$resultx /10000; print Best score = , $individual-score(), correspond genes: , $resultx, \n; sub fitnessFunc { my $genes = shift; my $fitness; # assign a number to $fitness based on the @$genes # ... # y=x*sin(10*pi*x)+2 my $genes1=@$genes /10000; $fitness=$genes1*sin(10*3.14159*$genes1)+2; # print $genes1 $fitness\n; return $fitness; } sub terminateFunc { my $ga = shift; # terminate if reached some threshold. my $THRESHOLD=4; return 1 if $ga-getFittest-score $THRESHOLD; return 0; } ####################### 算的答案是Best score = 3.8492704037941 correspond genes: 1.8495 Best score = 3.84999793705916 correspond genes: 1.8511 ###################### ########需要修改pm中的optMutation.pm文件############################### ##### for my $g (@$genes)修改为 for my $g (@$genes-1)################################## package AI::Genetic::OpMutation; use strict; 1; # This package implements various mutation # algorithms. To be used as static functions. # sub bitVector(): # each gene is a bit: 0 or 1. arguments are mutation # prob. and anon list of genes. # returns anon list of mutated genes. sub bitVector { my ($prob, $genes) = @_; for my $g (@$genes) { next if rand $prob; $g = $g ? 0 : 1; } return $genes; } # sub rangeVector(): # each gene is a floating number, and can be anything # within a range of two numbers. # arguments are mutation prob., anon list of genes, # and anon list of ranges. Each element in $ranges is # an anon list of two numbers, min and max value of # the corresponding gene. sub rangeVector { my ($prob, $ranges, $genes) = @_; my $n= scalar(@$genes); #print end: $n \n; my $i = -1; for my $g (@$genes-1) { $i++; next if rand $prob; # now randomly choose another value from the range. my $abs = $ranges- - $ranges- + 1; $g = $ranges- + int rand($abs); } return $genes; } # sub listVector(): # each gene is a string, and can be anything # from a list of possible values supplied by user. # arguments are mutation prob., anon list of genes, # and anon list of value lists. Each element in $lists # is an anon list of the possible values of # the corresponding gene. sub listVector { my ($prob, $lists, $genes) = @_; my $i = -1; for my $g (@$genes) { $i++; next if rand $prob; # now randomly choose another value from the lists. my $new; if (@{$lists- } == 1) { $new = $lists- ; } else { do { $new = $lists- }]; } while $new eq $g; } $g = $new; } return $genes; } __END__ =head1 NAME AI::Genetic::OpMutation - A class that implements various mutation operators. =head1 SYNOPSIS See LAI::Genetic. =head1 DESCRIPTION This package implements a few mutation mechanisms that can be used in user-defined strategies. The methods in this class are to be called as static class methods, rather than instance methods, which means you must call them as such: AI::Genetic::OpCrossover::MethodName(arguments) =head1 CLASS METHODS There is really one kind of mutation operator implemented in this class, but it implemented for the three default individuals types. Each gene of an individual is looked at separately to decide whether it will be mutated or not. Mutation is decided based upon the mutation rate (or probability). If a mutation is to happen, then the value of the gene is switched to some other possible value. For the case of Ibitvectors, an ON gene switches to an OFF gene. For the case of Ilistvectors, a gene's value is replaced by another one from the possible list of values. For the case of Irangevectors, a gene's value is replaced by another one from the possible range of integers. Thus, there are only three methods: =over =item BbitVector(Imut_prob, genes) The method takes as input the mutation rate, and an anonymous list of genes of a bitvector individual. The return value is an anonymous list of mutated genes. Note that it is possible that no mutation will occur, and thus the returned genes are identical to the given ones. =item BlistVector(Imut_prob, genes, possibleValues) The method takes as input the mutation rate, an anonymous list of genes of a listvector individual, and a list of lists which describe the possible values for each gene. The return value is an anonymous list of mutated genes. Note that it is possible that no mutation will occur, and thus the returned genes are identical to the given ones. =item BrangeVector(Imut_prob, genes, rangeValues) The method takes as input the mutation rate, an anonymous list of genes of a rangevector individual, and a list of lists which describe the range of possible values for each gene. The return value is an anonymous list of mutated genes. Note that it is possible that no mutation will occur, and thus the returned genes are identical to the given ones. =back =head1 AUTHOR Written by Ala Qumsieh Iaqumsieh@cpan.org. =head1 COPYRIGHTS (c) 2003,2004 Ala Qumsieh. All rights reserved. This module is distributed under the same terms as Perl itself. =cut genetic.pl OpMutation.pm
想MS用perl编程,先熟悉perl的语法,没必要深究,看 http://perldoc.perl.org/perlintro.html 就够了,主要熟悉变量、operator,file and I/O。 然后就是到MS看功能函数或模块,这才是重点。 1. 变量定义问题 Perl默认所有变量为包变量(Package variables),包变量为全 局变量,这意味着程序的任何其他部分,甚至在其他文件里定义的子程序,都能影响和修改变量的值。在一定程度上讲,这样是“不安全”的。 my变量: 自Perl 5以后,增加了新的非全局变量,也称为词法变量、私有变量、局部变量,或称为 my变量 。注意,不是local变量,local变量有另外的含义。例如 my $a; 就定义了一个局部变量$a, 它的作用域是当前块(Block),通俗地讲就是大括号里面,如果没有大括号就是从定义的地方开始到程序结束 。而local变量的意思,是 local变量 :local和my明显的不同,my创建局部变量,而local作用于包变量。具体地讲, local $x 实际做的事是: 存储包变量$x的当前值在一个安全的地方,然后用一个新值替换它,假如没有指定新值,就使用undef代 替。当控制离开当前块时,它也会恢复$x的旧值。 它影响的是包变量,这个包变量获取了本地值。但包变量总是全局的,local申明的包变量亦无例外。为了 显示其区别,请看这个: $lo = 'global'; $m = 'global'; A(); sub A { local $lo = 'AAA'; my $m = 'AAA'; B(); } sub B { print B , ($lo eq 'AAA' ? 'can' : 'cannot') , see the value of lo set by A.\n; print B , ($m eq 'AAA' ? 'can' : 'cannot') , see the value of m set by A.\n; } 结果会打印: B can see the value of lo set by A. B cannot see the value of m set by A. 发生了什么?在A函数里的local申明,给包变量$lo赋予了一个新的临时值AAA。旧值global会被存储起来,直到A返回,但在这点之前,A调用了B。B访问$lo的内容没有问题,因为$lo是包变量,包变量总是全局可见的,所以它能见到A设置的AAA值。 2. 自定义函数 用户函数又称子程序(Subroutine),在Perl中用下面的结构来定义用户函数: sub 子程序名{ 语句块; } 这里的子程序名与变量的取名规则类似。 以显示欢迎词的程序为例: sub say_hello{ print 你好,欢迎光临网上学园; } 用户函数的定义可以位于程序的任何位置,比如说放在文件的未尾。如果两个子程序使用了相同的程序名, 后面的子程序将覆盖前面子程序。 用户函数中的变量默认为全局变量,与其他程序共享。 用户函数的调用:通过在子程序前加“”调用,可在任一表达式内调用。 子程序中可以再调用另外的子程序。 调用用户函数产生的结果称为返回值(return value)。返回值是每次调用函数中最后一个表达式的计算值。 以加法函数为例: sub add_a_b{ $a+$b; } 函数最后一条表达式为$a+$b,故返回值为$a+$b。以下是调用情况: $a=5; $b=6; $c=add_a_b; #$c的值为11 $d=5*add_a_b; #$d的值为5*11即55 上述的函数功能与传统直接写在程序中没什么两样,如果加上参数传递就可以实现全新的功能了。 在Perl中,如果函数调用后面跟着一个用括号括起来的列表,则在函数调用期间该列表将被自动分配给以@_命名的特殊变量。 函数可以访问该变量,从而确定参数的个数及赋值。 仍以加法函数为例: sub add_a_b{ $_〔0〕+$_〔1〕; } $c=add_a_b(5,6); #$c的值为11 $d=5*add_a_b(2,3); #d的值为5*5即25 如何改变参数的个数呢?我们可以用循环的方式来实现: sub add_all{ $sum=0; #将sum初始化 foreach $_(@_) { #遍历参数列表 $sum+=$_; #累加每个元素 } $sum; #返回sum即总和的值 } $a=add_all(3,4,5); #$a的值为3+4+5即12 $d=2*add_all(1,2,3,4,5); #d的值为2*15即30 既然函数中的变量全为全程变量,那么上述程序中若调用程序中含有$sum变量时将替换,这不是我们所要的。 那么如何解决这一问题呢? 答案就是:使用局部变量, 使用local()操作符就可实现此功能。在上面的程序中,只需在第一行$sum=0;前加入: local($sum); 当函数执行时,$sum的全程变量的值被保留起来,同时建立一个局部变量$sum。 用perl进行截断能收敛性测试(Convergence test)的例子,以供大家参考。 =================================================================== #!perl use strict; use MaterialsScript qw(:all); # 定义计算文档和结果输出文档 my $myDoc=$Documents{WO3.xsd};#要计算的模型文件 my $myStudyTable=Documents-new(energy-encut.std);#新建StudyTable存放计算结果,第一列是截断能,第二列是体系总能 my $mySheet=$myStudyTable-ActiveSheet; $mySheet-ColumnHeading(0)=Enegy Cutoff(eV); $mySheet-ColumnHeading(1)=Final Enegy(eV); # castep single calculation my $castep=Modules-CASTEP; my $startEnergy=200;#测试起点:200 eV my $endEnergy=500;#测试终点: 500 eV my $intervalEnergy=20;#测试点间隔 my $sumIteration=($endEnergy-$startEnergy)/$intervalEnergy; #循环计算每个截断能测试点的单点能 for(my $counter=0;$counter=$sumIteration;++$counter){ my $energyCutoff=$startEnergy+$intervalEnergy*$counter; $castep-ChangeSettings( Settings(Quality=Fine, UseCustomEnergyCutoff=Yes, EnergyCutoff=$energyCutoff));#这里设置截断能为测试点的截断能 $castep-Energy-Run($myDoc); $mySheet-Cell($counter,0)=$energyCutoff; #read final energy from castep output files #下面这一段提取castep文件中的总能,即Final energy后面的结果 foreach my $line (@{$Documents{WO3.castep}-Lines}) { if ($line=~/^Final energy/){ my $finalEnergy = substr($line,31,15); print $energyCutoff,\t,$finalEnergy,\n; $mySheet-Cell($counter,1)=$finalEnergy;#将总能结果放入StudyTable }} }
1. 让我们开始吧 为了让第一次使用本手册的同志们在刚开始就有成功的喜悦,这里给出一个例子,大家准备好自己手中的 fasta 文件吧! 请在文本编辑器中写入如下程序,并在终端运行: #! /usr/bin/perl –w use strict; use Bio::SeqIO; my $file = “*********”; # Please use your path to replace the starts my $seqio_object = Bio::SeqIO - new(-file = $file); my $seq_obj = $seqio_object - next_seq; printf “$seq_obj\n”; 如果你成功键入了以上程序并且没有报错发生,那么屏幕上面就会正常显示出你的 fasta 序列。那么恭喜你,你已经成功调用了 BioPerl 的模块,并且完成了一个面对对象的程序。下面我们就来看一下我们第一个认识的 BioPrel 的模块 Bio::SeqIO 。 2. 关于 Bio::SeqIO 的那些事儿 在介绍 Bio::SeqIO 之前,先来说一下为什么会产生 BioPerl 这个东西。在生物信息学起步之初 Prel 语言强悍的字符串处理能力以及执行效率,毫无疑问的被各位从计算机和数学行业转行过来的“生物学家”选为工具语言(在生物信息数据处理方面放眼望去毫无疑问是 Perl 语言的天下,近来对大规模数据的处理方面 R 语言亦有崛起之势)。但是,对于这海量的数据,同样丰富多彩的数据格式以及花样繁多的数据分析;每次处理数据都要重新自己编写正则表达式未免效率过于低下。于是,在 Perl 一次重大的更新之后(引入面对对象编程,后面都将使用 OOP 代替面对对象编程),几个不太勤快的学生物的程序员看到了通用编程的可能,于是就有了我们现在广泛应用的 BioPerl 。 那我们就来说说这个 Bio::SeqIO 以及它的姊妹模块 Bio::Seq 。我们为什么要使用 Bio::SeqIO 和 Bio::Seq 模块呢?其原因非常简单,就是因为这两个模块其实就是一个非常非常智能的文件句柄。 Bio::SeqIO 可以根据你的输入文件类型抽取出所需要的信息,而 Bio::Seq 则可以按照格式要求储存数据信息。就拿 GenBank 的 flat file 文件来讲,其中的 feature 等信息都是分门别类的进行储存。在这里说也不容易理解,下面我们直接上程序来说明。 3. Bio::SeqIO 支持的文件格式 Bio::SeqIO 几乎涵盖所有常见的生物学数据库的通用文档格式,并且可以很好的对格式进行转换,有如此方便的功能,全仰仗于 Perl 语言本身所具有的强悍的字符串处理能力。 下表所展示的就是截止于 1.6 版本 Bio::SeqIO 所支持的格式: Name Description File extension Module abi ABI tracefile ab Bio::SeqIO::abi ace Ace database ace Bio::SeqIO::ace agave AGAVE XML Bio::SeqIO::agave alf ALF tracefile alf Bio::SeqIO::alf asciitree write-only, to visualize features Bio::SeqIO::asciitree bsml BSML , using XML::DOM bsml Bio::SeqIO::bsml bsml_sax BSML , using XML::SAX Bio::SeqIO::bsml_sax chadoxml CHADO sequence format Bio::SeqIO::chadoxml chaos CHAOS sequence format Bio::SeqIO::chaos chaosxml Chaos XML Bio::SeqIO::chaosxml ctf CTF tracefile ctf Bio::SeqIO::ctf embl EMBL database ebl|emb|dat Bio::SeqIO::embl entrezgene Entrez Gene ASN1 Bio::SeqIO::entrezgene excel Excel Bio::SeqIO::excel exp Staden EXP format exp Bio::SeqIO::exp fasta FASTA fast|seq|fa|fsa|nt|aa Bio::SeqIO::fasta fastq quality score data in FASTA -like format fastq Bio::SeqIO::fastq flybase_chadoxml variant of Chado XML Bio::SeqIO::flybase_chadoxml game GAME XML Bio::SeqIO::game gcg GCG gcg Bio::SeqIO::gcg genbank GenBank gbank|genbank Bio::SeqIO::genbank interpro InterProScan XML Bio::SeqIO::interpro kegg KEGG Bio::SeqIO::kegg largefasta Large files, fasta format Bio::SeqIO::largefasta lasergene Lasergene format Bio::SeqIO::lasergene locuslink LocusLink LL_tmpl Bio::SeqIO::locuslink metafasta Bio::SeqIO::metafasta phd Phred phred Bio::SeqIO::phd pir PIR database pir Bio::SeqIO::pir pln PLN tracefile pln Bio::SeqIO::pln qual Phred Bio::SeqIO::qual raw plain text txt Bio::SeqIO::raw scf Standard Chromatogram Format scf Bio::SeqIO::scf seqxml SeqXML sequence format using XML::LibXML and XML::Writer xml Bio::SeqIO::seqxml strider DNA Strider format Bio::SeqIO::strider swiss SwissProt sp Bio::SeqIO::swiss tab tab-delimited Bio::SeqIO::tab table Table Bio::SeqIO::table tigr TIGR XML Bio::SeqIO::tigr tigrxml TIGR Coordset XML Bio::SeqIO::tigrxml tinyseq NCBI TinySeq XML Bio::SeqIO::tinyseq ztr ZTR tracefile ztr Bio::SeqIO::ztr 注意: bioperl-ext 以及 io_lib 库文件对于支持 scf, abi, alf, pln, exp, ctf, ztr 格式是不可或缺的。 下面我们就用例子来说话,来看一下 Bio::SeqIO 是如何将一个复杂的文件储存在对象( OOP 的概念)中方便处理的。 4. 方法大全 4.1. next_seq 首先来看代码如下所示: # print out accession numbers of all entries in a GenBank flat file. # first, bring in the SeqIO module use Bio :: SeqIO ; # Notice that you do not have to use any Bio:SeqI # objects, because SeqIO does this for you. In fact, it # even knows which SeqI object to use for the provided # format. # Bring in the file and format, or die with a nice # usage statement if one or both arguments are missing. my $usage = "getaccs.pl file format \n " ; my $file = shift or die $usage ; my $format = shift or die $usage ; # Now create a new SeqIO object to bring in the input # file. The new method takes arguments in the format # key = value, key = value. The basic keys that it # can accept values for are '-file' which expects some # information on how to access your data, and '-format' # which expects one of the Bioperl-format-labels mentioned # above. Although it is optional, it is good # programming practice to provide and in front of any # filenames provided in the -file parameter. This makes the # resulting filehandle created by SeqIO explicitly read () # or write(). It will definitely help others reading your # code understand the function of the SeqIO object. my $inseq = Bio :: SeqIO - new ( - file = "$file" , - format = $format , ) ; # Now that we have a seq stream, # we need to tell it to give us a $seq. # We do this using the 'next_seq' method of SeqIO. while ( my $seq = $inseq - next_seq ) { print $seq - accession_number , " \n " ; } 不知各位是否看出来以上代码的功能了吗?没错,其功能就是读取一个输入的一定格式文件的 Accession Number 。没有自己设计的正则表达式,不需要自行找出文件的规律,更不用每次都重新写一个处理文本的程序。需要做的就是关注你所需要的重点信息, SeqIO 模块的 next_seq 方法来搞定其他的东西。 至于程序中各变量的含义,待我细细分解,就是一些 OOP 的概念,多加揣摩就能看懂以后的关于 OOP 的程序啦。 首先来看 $inseq ,这个变量就可以理解为一个 object/instance , new 方法或函数建立的一个模块 SeqIO 的对象,模块中的方法都可以用于对这个对象进行处理。而 next_seq 就这之中的诸多方法之一。基本的语法结构就是“对象 – 方法( instance – method )”的形式。 next_seq 完成的动作就是取出文件中的一个 entry 的全部内容,然后根据需求找出你想要的数据就行了,比如 Accession Number 。 4.2. write_seq 看下面一个将一个格式的文件转为另外一个格式的程序: use Bio :: SeqIO ; # get command-line arguments, or die with a usage statement my $usage = "x2y.pl infile infileformat outfile outfileformat \n " ; my $infile = shift or die $usage ; my $infileformat = shift or die $usage ; my $outfile = shift or die $usage ; my $outfileformat = shift or die $usage ; # create one SeqIO object to read in,and another to write out my $seq_in = Bio :: SeqIO - new ( - file = "$infile" , - format = $infileformat , ) ; my $seq_out = Bio :: SeqIO - new ( - file = "$outfile" , - format = $outfileformat , ) ; # write each entry in the input file to the output file while ( my $inseq = $seq_in - next_seq ) { $seq_out - write_seq ( $inseq ) ; } 通过上一个例子,现在就可以知道了, write_seq 是对象 $seq_out 的一个方法,这个方法有一个变量,那就是 $inseq , $inseq 则是通过对象 $seq_in 的方法 next_seq 获得的。有点乱哈,不过仔细揣摩一下,对 OOP 加深一下认识。这里就不用自己设计需要输出什么了。因为,要输出的信息已经通过 next_seq 方法获得了。 In the end 至此 Bio::SeqIO 模块的两个主要方法已经介绍完了,其方法内各种参数的意义,请见本篇附表。下一弹, Bio::Seq 模块以及 translation 的应用,敬请期待!
今天在读《Shell 脚本学习指南》 时得到启发,很有兴趣写一个词频统计的软件。因此就花了几个小时用Perl语言写了一个100多行代码的软件。word_freq以自由软件 、开放源代码 的形式发布在此。文后附有源代码。 一、运行环境 1 perl 软件是由 Perl 写成,因此运行软件前,电脑上必须有 perl 解释器 , 可以在这里下载http://www.perl.org/get.html#win32 2 命令行 必须在命令行用户界面(Command User Interface) 下运行,因为软件是从标准输入(STDIN)读入文本流, 而将结果打印到标准输出(STDOUT), 可以很方便地做I/O重定向,以及组合管道。 二、输入输出 1 输入 输入为纯文本,未考虑支持中文。软件从标准输入读入数据,可以使用I/O重定向符号 ‘’ 或管道输入数据,也可以读取用户键入的内容。例如,cat file | word_freq, 或者 word_freq file,或者word_freq, 然后键入英文单词,Ctrl-D 结束。 2 输出 结果的输出如: Rank Word Freq. Sum 1 the 394 394 2 of 322 716 3 and 156 872 4 to 146 1018 5 in 123 1141 6 genome 98 1239 7 B 95 1334 8 a 84 1418 9 for 72 1490 10 were 69 1559 Total words in text: 7063 第一列为排序,第二列为单词,第三列为次数,第四列为累加,最后一行为总词数。 3 参数 -h 打印帮助页 -c 统计字符,而不是单词 -m NUM 打印单词出现次数不少于NUM的单词 -M NUM 打印单词出现次数不多于NUM的单词 -w NUM 打印单词长度不少于NUM的单词 -W NUM 打印单词长度不大于NUM的单词 -i 不区分大小写 以上参数可以组合使用 三、用途 1 文本分析 用于分析文章的词频。 2 辅助阅读英文论文 我使用了一篇英文论文做测试, 不区分大小写,统计获得1577个单词。看来只要掌握不超过2000个单词,就可以读懂一篇科学论文。 3计算DNA序列的GC含量。 参考资料: http://book.douban.com/subject/3519360/ http://www.gnu.org/gnu/the-gnu-project.html http://zh.wikipedia.org/wiki/Perl http://zh.wikipedia.org/wiki/命令行界面 源代码: #!/usr/bin/perl parse_commands(); if($help){help();} # # Parse input text # unless(@input_files){ while(STDIN){ if(\$character){@txt = $_ =~ /./g;} else{@txt = $_ =~ / +/g;} foreach(@txt){ if(\$ignore_case){\$_ = "\L$_\E";} \$word{$_}++; } $total += @txt; } }else{ foreach(@input_files){ open FILE,\$_ or die "Can't open file \$_: $!\n"; while(FILE){ if(\$character){@txt = \$_ =~ /./g;} else{@txt = \$_ =~ / +/g;} foreach(@txt){ if(\$ignore_case){\$_ = "\L$_\E";} \$word{$_}++; } $total += @txt; } close FILE; } } # # Print title # print "Rank\t"; if($character){ print "Char.\t"; }else{ print "Word\t"; } print "Freq.\t"; print "Sum\n"; # # Print frequency # foreach(sort{\$word{\$b} = \$word{$a}}(keys %word)){ if(\$min_freq \$word{\$_} $min_freq){next;} if(\$max_freq \$word{\$_} $max_freq){next;} if(\$min_length length(\$_) $min_length){next;} if(\$max_length length(\$_) $max_length){next;} $count++; \$sum += \$word{$_}; print "$count\t"; print "$_\t"; print "\$word{$_}\t"; print "$sum\n"; } print "Total ",(\$character?"characters":"words")," in text: $total\n"; # # Subroutines # sub parse_commands{ while(@ARGV){ $_ = shift @ARGV; if(-e \$_){push @input_files,$_;} elsif(\$_ eq '-h'){$help = 1;} elsif(\$_ eq '-c'){$character = 1;} elsif(\$_ eq '-m'){$min_freq = shift @ARGV;} elsif(\$_ eq '-M'){$max_freq = shift @ARGV;} elsif(\$_ eq '-w'){$min_length= shift @ARGV;} elsif(\$_ eq '-W'){$max_length = shift @ARGV;} elsif(\$_ eq '-i'){$ignore_case = 1;} else{ print STDERR "Unrecognized flag: $_\n"; print STDERR "$0 -h for help\n"; exit; } } } sub help{ system("clear"); print "WORD_FREQ(1) Word Frequency Analysis WORD_FREQ(1) NAME word_freq - word frequency analysis SYNOPSIS word_freq ... ... DESCRIPTION Count words of text from FILE(s), or standard input, and print the frequency of each word or character. OPTIONS -c Print frequency of characters -m NUM Print words with minimum frequency NUM -M NUM Print words with maximum freqeuncy NUM -w NUM Print words with minimum length NUM -W NUM Print words with maximum length NUM -i Ignore case -h Display this help and exit With no FILE, or when FILE is -, read standard input. AUTHOR Written by Leiting Li lileiting\@foxmail.com COPYRIGHT Copyright (c) 2012 Leiting Li. Licnese GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html. This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extend permitted by law. LEITING LI Febrary 2012 WORD_FREQ(1) "; exit; } (Leiting Li, Feb 26, 2012)
想学习好perl的,perldoc是一个不错的地方~,看看人家解释split多么清楚。以前很多人推荐,还不是很信,去过几次都浮光掠影没发现精华,今天算是领教了。 split /PATTERN/,EXPR,LIMIT split /PATTERN/,EXPR split /PATTERN/ split Splits the string EXPR into a list of strings and returns that list. By default, empty leading fields are preserved, and empty trailing ones are deleted. (If all fields are empty, they are considered to be trailing.) In scalar context, returns the number of fields found. If EXPR is omitted, splits the $_ string. If PATTERN is also omitted, splits on whitespace (after skipping any leading whitespace). Anything matching PATTERN is taken to be a delimiter separating the fields. (Note that the delimiter may be longer than one character.) If LIMIT is specified and positive, it represents the maximum number of fields the EXPR will be split into, though the actual number of fields returned depends on the number of times PATTERN matches within EXPR. If LIMIT is unspecified or zero, trailing null fields are stripped (which potential users of pop would do well to remember). If LIMIT is negative, it is treated as if an arbitrarily large LIMIT had been specified. Note that splitting an EXPR that evaluates to the empty string always returns the empty list, regardless of the LIMIT specified. A pattern matching the empty string (not to be confused with an empty pattern // , which is just one member of the set of patterns matching the epmty string), splits EXPR into individual characters. For example: print join ( ':' , split ( / */ , 'hi there' ) ) , "\n" ; produces the output 'h:i:t:h:e:r:e'. As a special case for split , the empty pattern // specifically matches the empty string; this is not be confused with the normal use of an empty pattern to mean the last successful match. So to split a string into individual characters, the following: print join ( ':' , split ( // , 'hi there' ) ) , "\n" ; produces the output 'h:i: :t:h:e:r:e'. Empty leading fields are produced when there are positive-width matches at the beginning of the string; a zero-width match at the beginning of the string does not produce an empty field. For example: print join ( ':' , split ( /(?=\w)/ , 'hi there!' ) ) ; produces the output 'h:i :t:h:e:r:e!'. Empty trailing fields, on the other hand, are produced when there is a match at the end of the string (and when LIMIT is given and is not 0), regardless of the length of the match. For example: print join ( ':' , split ( // , 'hi there!' , -1 ) ) , "\n" ; print join ( ':' , split ( /\W/ , 'hi there!' , -1 ) ) , "\n" ; produce the output 'h:i: :t:h:e:r:e:!:' and 'hi:there:', respectively, both with an empty trailing field. The LIMIT parameter can be used to split a line partially ( $login , $passwd , $remainder ) = split ( /:/ , $_ , 3 ) ; When assigning to a list, if LIMIT is omitted, or zero, Perl supplies a LIMIT one larger than the number of variables in the list, to avoid unnecessary work. For the list above LIMIT would have been 4 by default. In time critical applications it behooves you not to split into more fields than you really need. If the PATTERN contains parentheses, additional list elements are created from each matching substring in the delimiter. split ( /( )/ , "1-10,20" , 3 ) ; produces the list value ( 1 , '-' , 10 , ',' , 20 ) If you had the entire header of a normal Unix email message in $header, you could split it up into fields and their values this way: $header =~ s/\n(?=\s)//g ; # fix continuation lines %hdrs = ( UNIX_FROM = split /^(\S*?):\s*/m , $header ) ; The pattern /PATTERN/ may be replaced with an expression to specify patterns that vary at runtime. (To do runtime compilation only once, use /$variable/o .) As a special case, specifying a PATTERN of space ( ' ' ) will split on white space just as split with no arguments does. Thus, split ( ' ' ) can be used to emulate awk 's default behavior, whereas split ( / / ) will give you as many initial null fields (empty string) as there are leading spaces. A split on /\s+/ is like a split ( ' ' ) except that any leading whitespace produces a null first field. A split with no arguments really does a split ( ' ' , $_ ) internally. A PATTERN of /^/ is treated as if it were /^/m , since it isn't much use otherwise. Example: open ( PASSWD , '/etc/passwd' ) ; while ( PASSWD ) { chomp ; ( $login , $passwd , $uid , $gid , $gcos , $home , $shell ) = split ( /:/ ) ; #... } As with regular pattern matching, any capturing parentheses that are not matched in a split() will be set to undef when returned: @fields = split /(A)|B/ , "1A2B3" ; # @fields is (1, 'A', 2, undef, 3)
Answer: how do i find the index of a specific array value? contributed by merlyn my @array = qw( your array here ); my $search_for = "here"; my( $index )= grep { $array eq $search_for } 0..$#array; Answer: how do i find the index of a specific array value? contributed by hdp For very large arrays where bailing out as soon as a match is found is a win: my @a = ( 1 .. 1_000_000 ); # some large array my $want = 5843; my $index = 0; ++$index until $a == $want or $index $#a; Answer: how do i find the index of a specific array value? contributed by I0 You could use an index hash: my @array = qw( your array here ); my $search = "array"; my %index; @index{@array} = (0..$#array); my $index = $index{$search}; print $index, "\n"; This is a win, for larger arrays, if you need to do multiple/many lookups while the array remains static. Answer: How do I find the index of a specific array value? contributed by snoopy I like List::MoreUtils : use List::MoreUtils; my @array = qw( Apples Oranges Brains Toes Kiwi); my $search = "Toes"; my $index = List::MoreUtils::first_index {$_ eq $search} @array; print "index of $search = $index\n"; Answer: How do I find the index of a specific array value? contributed by marcussen Using merlyn's example with regular expressions, so you don't need to know the exact value of the element you are matching; my @array = ( 'Name: Mr. Jones', 'Phone: 555-555', 'Email: jones@example.com' ); my ( $index )= grep { $array =~ /Phone/ } 0..$#array; Replace ( $index ) with an array to match multiple instances. Answer: how do i find the index of a specific array value? contributed by MeowChow You could use my aindex and raindex . Answer: How do I find the index of a specific array value? contributed by simul If the array is large and sorted, you might want search it more efficiently: my @array = qw( your large, sorted array here ); my $search = "thing"; my $index = bsearch(\@array, $search); sub bsearch { my ($array, $word) = @_; my $low = 0; my $high = @$array - 1; while ( $low = $high ) { my $try = int( ($low+$high) / 2 ); $low = $try+1, next if $array- lt $word; $high = $try-1, next if $array- gt $word; return $try; } return; } Answer: How do I find the index of a specific array value? contributed by myuserid7 You should usefirst()from List::Util . It is a core module, unlike List::MoreUtils and other modules mentioned above, so it is portable and can be used with no extra effort. use List::Util qw(first); my @array = qw( Apples Oranges Brains Toes Kiwi); my $search = "Toes"; my $index = first { $array eq $search } 0 .. $#array; print "index of $search = $index\n";
use warnings; use strict; open FH,"test.txt"; open RESULT,"result.txt"; open LINE,"line.txt"; open LEN, "length.txt"; while (FH) { #chomp; print RESULT "line $. is $_"; print LINE "line $.\n"; my $leng=length$_; print LEN "The length of line $. is $leng \n"; } close FH; test.txt as follow 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3
Perl's a great language for special variables - variables that are set up without the programmer having to intervene and providing information ranging from the number of lines read from the current input file ($.) through the current process ID ($$) and the operating system ($^O). Other special variables effect how certain operations are performed ($| controlling output buffering / flushing, for example), or are fundamental in the operation of certain facilities - no more so than $_ and @_. ------------------------------------------- 1.@_含义 1)是perl中默认的数组变量 比如说你想移除数组中的一个元素赋值给一个变$value 方法1:你可以定义某个数组如@abcd my $value=shift @abcd; 方法2:你没有定义任何数组 my $value=shift @_; 和上例等效 这里perl会隐式的选择@_ 2)是sub子函数中的默认参数列表. 例如: sub funct($$) { ($param1, $param2) = @_; #Statement } 再例如,有下面一段代码: my $max_number = max(1,2); print "1 and 2 ,the max number is $max_number\n"; sub max{ my ($num1,$num2) = @_ ; ## 取出参数列表中的元素。 ........此处省略求max运算 } 在子函数中直接shift; 就可以从@_的前端弹出一个元素. shift; 等于 shift @_; ------------------------- @_ is the list of incoming parameters to a sub. So if you write a sub, you refer to the first parameter in it as $_ , the second parameter as $_ and so on. And you can refer to $#_ as the index number of the last parameter: sub demo { print "Called with ",$#_+1," params\n"; print "First param was $_ \n"; Note that the English module adds in the ability to refer to the special variables by other longer, but easier to remember, names such as @ARG for @_ and $PID for $$. But use English; can have a detrimental performance effect if you're matching regular expressions against long incoming strings 2.$_含义 1)$_为默认列表变量。在一个命令没有任何参数的时候,表示它从默认变量里读取。 例如: print; 等于 print $_; 2)默认模式匹配空间( pattern matching space ) s/.../.../; 等于 $_ =~ s/.../.../; --------------------------- Then any regular expression matches, chop s (and lc s and many more) without a parameter, and even print s assume you want to work on $_. Thus: while ($line = FH) { if ($line =~ /Perl/) { print FHO $line; } print uc $line; } Shortens to: while (FH) { /Perl/ and print FHO ; print uc; } 3.$1,$2,...等含义 以数字为名的变量保存的是上一次匹配操作(/pattern/)中,第n个小括号中的原符号所匹配内容。 $1就是第一对小括号中的原符号所对应的匹配内容。 $2就是第二对小括号中的原符号所对应的匹配内容。 内插功能: $str = "aaa4zzz7bbb"; $str =~ /(\d)z{3}(\d)/; print "$1\t$2\n"; 输出结果是:4 7
今天本来打算查R的循环语句,无意间碰到了一个函数,合并两个 数据框,能够替代perl里面的一段程序,用于按照ID列表提取数据。 x-data.frame(a=c(a,b,c,c)) x a 1 a 2 b 3 c 4 c y-data.frame(a=c(a,b,c), b=c(10,20,30)) y a b 1 a 10 2 b 20 3 c 30 merge(x,y) a b 1 a 10 2 b 20 3 c 30 4 c 30 ?merge Merge two data frames by common columns or row names, or do other versions of database join operations. 相应的perl程序: open(DATABASE,$ARGV ); open(PICK,$ARGV ); open(PICKDATA,$ARGV ); while (defined($in_line = DATABASE)) { chomp $in_line; @tokens = split /\t/,$in_line; $tokens_ID = $tokens ; $hash{$tokens_ID}=$in_line; } while (defined($in_line = PICK)) { chomp $in_line; @Seq_tokens = split /\t/,$in_line; $Seq_tokens_ID = $Seq_tokens ; print PICKDATA $hash{$Seq_tokens_ID}\n; } close DATABASE; close PICK; close PICKDATA;