博文

Perl代码实例之词频统计软件word_freq

已有 8700 次阅读 2012-4-26 13:55 |个人分类:Perl代码|系统分类:科研笔记|关键词:学者| 统计, Perl, 软件, 源代码

今天在读《Shell 脚本学习指南》[1]时得到启发，很有兴趣写一个词频统计的软件。因此就花了几个小时用Perl语言写了一个100多行代码的软件。word_freq以自由软件[2]、开放源代码的形式发布在此。文后附有源代码。

一、运行环境

1 perl

软件是由 Perl 写成，因此运行软件前，电脑上必须有 perl 解释器[3], 可以在这里下载 http://www.perl.org/get.html#win32

2 命令行

必须在命令行用户界面(Command User Interface)[4]下运行，因为软件是从标准输入(STDIN)读入文本流, 而将结果打印到标准输出(STDOUT), 可以很方便地做I/O重定向，以及组合管道。

二、输入输出

1 输入

输入为纯文本，未考虑支持中文。软件从标准输入读入数据，可以使用I/O重定向符号 ‘<’ 或管道输入数据，也可以读取用户键入的内容。例如，cat file | word_freq, 或者 word_freq < file，或者word_freq, 然后键入英文单词，Ctrl-D 结束。

2 输出

结果的输出如：

Rank Word Freq. Sum
1 the 394 394
2 of 322 716
3 and 156 872
4 to 146 1018
5 in 123 1141
6 genome 98 1239
7 B 95 1334
8 a 84 1418
9 for 72 1490
10 were 69 1559

Total words in text: 7063

第一列为排序，第二列为单词，第三列为次数，第四列为累加，最后一行为总词数。

3 参数

-h 打印帮助页

-c 统计字符，而不是单词

-m NUM 打印单词出现次数不少于NUM的单词

-M NUM 打印单词出现次数不多于NUM的单词

-w NUM 打印单词长度不少于NUM的单词

-W NUM 打印单词长度不大于NUM的单词

-i 不区分大小写以上参数可以组合使用

三、用途

1 文本分析用于分析文章的词频。

2 辅助阅读英文论文我使用了一篇英文论文做测试, 不区分大小写，统计获得1577个单词。看来只要掌握不超过2000个单词，就可以读懂一篇科学论文。

3计算DNA序列的GC含量。

参考资料：

[1] http://book.douban.com/subject/3519360/

[2]http://www.gnu.org/gnu/the-gnu-project.html

[3] http://zh.wikipedia.org/wiki/Perl

[4] http://zh.wikipedia.org/wiki/命令行界面

源代码：

#!/usr/bin/perl

&parse_commands();

if($help){&help();}

# Parse input text

unless(@input_files){

while(<STDIN>){

if(\$character){@txt = $_ =~ /./g;}

else{@txt = $_ =~ /[a-zA-Z]+/g;}

foreach(@txt){

if(\$ignore_case){\$_ = "\L$_\E";}

\$word{$_}++;

}

$total += @txt;

}

}else{

foreach(@input_files){

open FILE,\$_ or die "Can't open file \$_: $!\n";

while(<FILE>){

if(\$character){@txt = \$_ =~ /./g;}

else{@txt = \$_ =~ /[a-zA-Z]+/g;}

foreach(@txt){

if(\$ignore_case){\$_ = "\L$_\E";}

\$word{$_}++;

}

$total += @txt;

}

close FILE;

}

# Print title

print "Rank\t";

if($character){

print "Char.\t";

}else{

print "Word\t";

}

print "Freq.\t";

print "Sum\n";

# Print frequency

foreach(sort{\$word{\$b} <=> \$word{$a}}(keys %word)){

if(\$min_freq && \$word{\$_} < $min_freq){next;}

if(\$max_freq && \$word{\$_} > $max_freq){next;}

if(\$min_length && length(\$_) < $min_length){next;}

if(\$max_length && length(\$_) > $max_length){next;}

$count++;

\$sum += \$word{$_};

print "$count\t";

print "$_\t";

print "\$word{$_}\t";

print "$sum\n";

}

print "Total ",(\$character?"characters":"words")," in text: $total\n";

# Subroutines

sub parse_commands{

while(@ARGV){

$_ = shift @ARGV;

if(-e \$_){push @input_files,$_;}

elsif(\$_ eq '-h'){$help = 1;}

elsif(\$_ eq '-c'){$character = 1;}

elsif(\$_ eq '-m'){$min_freq = shift @ARGV;}

elsif(\$_ eq '-M'){$max_freq = shift @ARGV;}

elsif(\$_ eq '-w'){$min_length= shift @ARGV;}

elsif(\$_ eq '-W'){$max_length = shift @ARGV;}

elsif(\$_ eq '-i'){$ignore_case = 1;}

else{

print STDERR "Unrecognized flag: $_\n";

print STDERR "$0 -h for helpn";

exit;

}

sub help{

system("clear");

print "WORD_FREQ(1) Word Frequency Analysis WORD_FREQ(1)

NAME

word_freq - word frequency analysis

SYNOPSIS

word_freq [OPTION]... [FILE]...

DESCRIPTION

Count words of text from FILE(s), or standard input, and print the frequency of each word or character.

OPTIONS

-c Print frequency of characters

-m NUM Print words with minimum frequency NUM

-M NUM Print words with maximum freqeuncy NUM

-w NUM Print words with minimum length NUM

-W NUM Print words with maximum length NUM

-i Ignore case

-h Display this help and exit

With no FILE, or when FILE is -, read standard input.

AUTHOR

Written by Leiting Li <lileiting@foxmail.com>

GPL version 3 or later <http://gnu.org/licenses/gpl.html>.

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extend permitted by law.

LEITING LI Febrary 2012 WORD_FREQ(1)

exit;

}

(Leiting Li, Feb 26, 2012)

转载本文请联系原作者获取授权，同时请注明本文来自李雷廷科学网博客。
链接地址：https://m.sciencenet.cn/blog-656335-563860.html

上一篇：Perl代码实例分析之从下载序列·一
下一篇：科研撞车之两篇谷子(Setaria italica)全基因组测序论文同时发表

收藏分享