科学网—标签 - 中文文本分类

相关帖子	版块	作者	回复/查看	最后发表

lixiangdong 2012-5-16 17:36

之前都是在小训练集上做的实验，当训练集较大时，MyEclipse总是报heap space错。修改.ini文件也不行。后来，修改MyEclipse的run configuration中的arguments的VM arguments为-Xms500m -Xmx1024m，不再报错了。下面是运行结果（分类器是NaiveBayes）。结果不太理想（正确分类68.6%），有待改善。还有，似乎Accuracy By Class 与Confusion Matrix数据不一致，差距很大，why？ weka.filters.unsupervised.attribute.StringToWordVector in:9804 Number of instances: 9804 Number of attributes: 9302 === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.794 0.015 0.788 0.794 0.791 0.966 C11-Space 0.5 0.002 0.444 0.5 0.471 0.975 C15-Energy 0.704 0.005 0.264 0.704 0.384 0.981 C16-Electronics 0.68 0.003 0.395 0.68 0.5 0.99 C17-Communication 0.858 0.007 0.953 0.858 0.903 0.982 C19-Computer 0.545 0.006 0.234 0.545 0.327 0.945 C23-Mine 0.86 0.015 0.251 0.86 0.389 0.98 C29-Transport 0.726 0.015 0.8 0.726 0.761 0.951 C3-Art 0.72 0.026 0.799 0.72 0.757 0.946 C31-Enviornment 0.614 0.014 0.838 0.614 0.709 0.945 C32-Agriculture 0.762 0.044 0.773 0.762 0.767 0.924 C34-Economy 0.647 0.003 0.508 0.647 0.569 0.983 C35-Law 0.843 0.019 0.189 0.843 0.309 0.985 C36-Medical 0.892 0.064 0.096 0.892 0.173 0.969 C37-Military 0.432 0.016 0.754 0.432 0.549 0.832 C38-Politics 0.591 0.016 0.84 0.591 0.694 0.911 C39-Sports 0.636 0.008 0.212 0.636 0.318 0.971 C4-Literature 0.644 0.017 0.189 0.644 0.292 0.931 C5-Education 0.568 0.002 0.532 0.568 0.549 0.95 C6-Philosophy 0.584 0.038 0.436 0.584 0.499 0.908 C7-History Correctly Classified Instances 6730 68.6455 % Incorrectly Classified Instances 3074 31.3545 % Kappa statistic 0.6529 Mean absolute error 0.0313 Root mean squared error 0.1759 Relative absolute error 35.2848 % Root relative squared error 83.4657 % Total Number of Instances 9804 === Confusion Matrix === a b c d e f g h i j k l m n o p q r s t -- classified as 508 1 3 5 31 2 12 0 2 1 11 0 16 30 0 17 0 0 1 0 | a = C11-Space 1 16 0 1 0 3 3 0 1 0 0 0 2 4 0 0 0 1 0 0 | b = C15-Energy 2 0 19 1 0 1 2 0 0 0 0 0 1 0 0 1 0 0 0 0 | c = C16-Electronics 0 0 3 17 0 0 2 0 0 0 0 0 1 1 0 0 0 1 0 0 | d = C17-Communication 59 0 33 9 1164 0 7 3 4 0 10 3 16 33 0 8 3 3 0 2 | e = C19-Computer 0 1 0 0 0 18 6 0 1 2 0 0 1 3 0 0 0 1 0 0 | f = C23-Mine 1 1 0 0 0 1 49 0 0 1 0 0 1 0 1 1 0 1 0 0 | g = C29-Transport 0 0 0 0 0 1 1 537 0 0 3 0 12 27 12 18 44 16 1 68 | h = C3-Art 43 14 5 0 17 9 21 0 876 34 63 7 60 20 0 30 2 16 0 0 | i = C31-Enviornment 2 0 2 2 0 14 25 0 165 627 112 4 22 4 1 9 1 20 0 11 | j = C32-Agriculture 1 0 4 6 2 25 41 3 3 78 1219 12 13 30 74 29 0 15 1 44 | k = C34-Economy 0 0 0 0 0 0 0 0 0 0 0 33 0 14 1 0 0 3 0 0 | l = C35-Law 0 0 0 0 0 0 0 0 0 0 0 0 43 4 0 1 0 3 0 0 | m = C36-Medical 0 1 0 0 0 0 0 0 0 0 0 0 0 66 0 1 0 5 0 1 | n = C37-Military 0 1 1 0 0 1 1 4 0 0 74 5 1 299 442 7 4 16 14 154 | o = C38-Politics 28 1 1 2 8 1 16 35 45 2 64 1 26 138 28 740 4 41 2 70 | p = C39-Sports 0 0 1 0 0 0 1 2 0 0 0 0 0 2 0 1 21 3 0 2 | q = C4-Literature 0 0 0 0 0 0 1 0 0 0 1 0 3 5 2 8 1 38 0 0 | r = C5-Education 0 0 0 0 0 0 0 1 0 0 1 0 0 0 4 1 1 11 25 0 | s = C6-Philosophy 0 0 0 0 0 1 7 86 0 3 19 0 9 11 21 9 18 7 3 272 | t = C7-History 行列不齐，放个截图：

个人分类: weka|3114 次阅读|0 个评论

不做特征选择，就不知道去停词的重要性

lixiangdong 2012-5-10 21:01

之前，我曾经觉得不去停词也可以。这两天试做特征选择，才意识到去停词的重要性。一个包含984个实例的中文文本训练集（含数字和英文字母单词等），竟然有3500多个特征，做一次Greedstepwise的特征选择用了14个小时还没有结果。我决定必须去停词，而且打算使用大停词表。处理如下： (stopword.is(word)|| word.length()2||word.charAt(0)='0'word.charAt(0)='9' ||word.charAt(0)='a'word.charAt(0)='z') 其中stopword是用的weka的Stopwords类，停词表是去掉单字的哈工大表。确实去掉了很多词。但对1084个中文文本实例进行分析，仍然有4880个特征。对分词后的训练集进行特征选择（filter），发现耗费时间的主要是调用 evaluateSubset 处理非数字字段，weka 文件名 CfsSubsetEval.java。而且，处理速度一开始还挺快，越来越慢，到1000个特征以后就几乎慢得1秒一个了。开始怀疑是IK的分词效果不好，换了JE，结果竟然完全一样，一个数都不差！这可怎么办？

个人分类: weka|5123 次阅读|0 个评论

基于weka的中文文本分类：对任意单个文本进行分类

lixiangdong 2012-5-8 14:28

网上关于使用weka进行中文分类的文章很多了。这里只讨论一个具体的问题：已经有了一个训练集，也选用了一种分类器如NaiveBayes，并进行了训练得到了一个分类模型，那么对于任意一个中文文本现在如何使用那个分类模型进行分类呢？这里的一个关键问题是，对一个单个文本进行量化（weka的术语是filter，即过滤）时，如何保证得到的arff文件的属性项和顺序与分类模型中的是一致的呢？首先，分类训练过程没必要重复。如果可以重复，那可以使用weka的batch filtering功能，先filter训练集，再filter待分类文本即可。我想过用训练集filter之后的arff作为输入格式，即 filter.setInputFormat(traindataFiltered); Instances testFiltered = Filter.useFilter(testRaw, filter); 但weka提示这不允许。这种方法是否可行，还有待研究，也期待高手指点。于是，我采用一种笨办法：先对待分类文本进行分词、保存到和训练数据相同的一个目录结构里、使用TextDirectoryLoader转换成原始arff、使用和训练集一样的filter进行过滤，然后采用 http://weka.wikispaces.com/file/view/M5PExample.java 中的方法，根据训练集instances的header即结构信息，比较待分类文本中的每个属性，从而生成一个新的instance。最后，使用训练好的模型对这个instance进行分类。下面是代码： /*trained attributes need to be loaded first*/ String trainsetfile = "dataFiltered.arff"; BufferedReader reader = new BufferedReader(new FileReader(trainsetfile)); ArffReader arff = new ArffReader(reader); Instances trainheader = arff.getStructure(); trainheader.setClassIndex(0); // deserialize a model if there already is such one Classifier classifier = (Classifier) weka.core.SerializationHelper.read(Modelpath); /* you should prepare the sample in two steps: * 1. segment 2. filter in the same way the train dataset was done. * It is supposed that the sample text file is not segmented yet and is stored in the file folder of * the sample directory which has the same directory structure as the traindata. */ segmenter segmenter = new segmenter( sampledirectory, "test_segmented" ); segmenter.segment(); // convert the "test_segmented" directory into a dataset TextDirectoryLoader loader = new TextDirectoryLoader(); loader.setDirectory(new File( "test_segmented" )); Instances sampleRaw = loader.getDataSet(); // we first filter it blindly StringToWordVector filter = new StringToWordVector(); filter.setStemmer( new NullStemmer() ); filter.setInputFormat(sampleRaw); Instances testFiltered = Filter.useFilter(sampleRaw, filter); //then we transform the instances according to the train data structure(header) // see also: http://weka.wikispaces.com/file/view/M5PExample.java Instances header=trainheader ; Instances data=testFiltered ; for (int i = 0; i data.numInstances(); i++) { Instance curr = data.instance(i); Instance inst = new Instance(header.numAttributes()); inst.setDataset(header); for (int n = 0; n header.numAttributes(); n++) { Attribute att = data.attribute(header.attribute(n).name()); // original attribute is also present in the current dataset if (att != null) { if (att.isNominal()) { if ((header.attribute(n).numValues() 0) (att.numValues() 0)) { String label = curr.stringValue(att); int index = header.attribute(n).indexOfValue(label); if (index != -1) inst.setValue(n, index); } } else if (att.isNumeric()) { inst.setValue(n, curr.value(att)); } else { throw new IllegalStateException("Unhandled attribute type!"); } } } double prediction = classifier.classifyInstance(inst); String category = trainheader.classAttribute().value((int)prediction); System.out.println("the sample is belong to: " + category); } if (testFiltered.numInstances()==0) System.out.println("there is no instance found in the arff file.");

个人分类: weka|8902 次阅读|0 个评论

Weka中文文本分类示例

热度 1 lixiangdong 2012-5-7 15:31

根据屈伟博客编写。原文 http://quweiprotoss.blog.163.com/blog/static/40882883201103051150347/ 第零步，准备你需要的工具， weka.jar ， lucene-core.jar ， IKAnalyzer.jar ，把它们加到工程中。分词包你喜欢用什么自己选，不必非用 IKAnalyzer.jar 。第一步，你要有中文的数据集，如果你已经有了任务，自不必说。如果没有，那一定要选一个公认的最好，我以前是用搜狗的文本分类数据集，后来发现搜狗的数据好像也不怎么被人承认。看网上说，北京大学建立的人民日报语料库、清华大学建立的现代汉语语料库这两个数据集似乎比较正式点，但人民日报这个数据集我感觉实在不怎么样，并且它毕竟是人民日报呀，能不选就不选。现在汉语语料库找了两下没找到。谭松波先生的数据集要一个声明，懒得写。感觉最方便的还是复旦的一个数据集： http://www.nlp.org.cn/docs/20030623/25/tc-corpus-train.rar 。这个数据集我感觉不好的一点是它不是从同一个源上找的。第二步，数据集要准备成 weka 能处理的结构，这很好做到，你把数据集压缩了就行了，因为它要求的格式是，一个类别的文件放一个文件夹下 ( 你可以参考我 weka ) 。但是还有一个问题，你的机器往往没那么多内存去处理这个数据集，那么你可以选几个类别出来，在每个类别中放几十个文档来做就可以了。第三步，分词，在 wvtool 里你可以继承它的分词类，使用自己的逻辑， weka 也是可以的。但是最方便的还是直接分词。我的做法很简单，把源文件夹下的文件全部分好词，再保存到另一个文件中，下面是我的实现代码 ( 呵呵，见笑了，其实我不懂 java) ： package preprocess; import java.io.BufferedReader; import java.io.BufferedWriter; import java.io.File; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.TokenStream; import org.apache.lucene.analysis.tokenattributes.TermAttribute; import org.wltea.analyzer.lucene.IKAnalyzer; public class Segmenter { private String sourceDir ; private String targetDir ; Segmenter( String source, String target ) { sourceDir = source; targetDir = target; } public void segment() { segmentDir( sourceDir , targetDir ); } public void segmentDir( String source, String target ) { File .isFile()) { segmentFile( file .getAbsolutePath(), target + File. separator + file .getName() ); } if (file .isDirectory()) { String _sourceDir = source + File. separator + file .getName(); String _targetDir = target + File. separator + file .getName(); ( new File(_targetDir)).mkdirs(); segmentDir( _sourceDir, _targetDir ); } } } public void segmentFile( String sourceFile, String targetFile ) { try { FileReader fr = new FileReader( sourceFile ); BufferedReader br = new BufferedReader(fr); FileWriter fw = new FileWriter( targetFile ); BufferedWriter bw = new BufferedWriter(fw ); Analyzer analyzer = new IKAnalyzer(); TokenStream tokenStream = analyzer.tokenStream( "" , br ); TermAttribute termAtt = (TermAttribute) tokenStream .getAttribute(TermAttribute. class ); while (tokenStream.incrementToken()) { bw.write( termAtt.term() ); bw.write( ' ' ); } bw.close(); fw.close(); } catch ( IOException e ) { e.printStackTrace(); } } public static void main( String args) throws Exception { String filename = "D:\\workspace\\text_mining\\test" ; // convert the directory into a dataset TextDirectoryLoader loader = new TextDirectoryLoader(); loader.setDirectory( new File( filename )); Instances dataRaw = loader.getDataSet(); //System.out.println("\n\nImported data:\n\n" + dataRaw); { FileWriter fw = new FileWriter( "dataRaw.arff" ); BufferedWriter bw = new BufferedWriter(fw ); bw.write( dataRaw.toString() ); bw.close(); fw.close(); } StringToWordVector filter = new StringToWordVector(); filter.setStemmer( new NullStemmer() ); filter.setInputFormat(dataRaw); Instances dataFiltered = Filter. useFilter (dataRaw, filter); { FileWriter fw = new FileWriter( "dataFiltered.arff" ); BufferedWriter bw = new BufferedWriter(fw ); bw.write( dataFiltered.toString() ); bw.close(); fw.close(); } //System.out.println("\n\nFiltered data:\n\n" + dataFiltered); dataFiltered.setClassIndex( 0 ); // train NaiveBayes and output model NaiveBayes classifier = new NaiveBayes(); /*classifier.buildClassifier(dataFiltered); System.out.println("\n\nClassifier model:\n\n" + classifier);*/ Evaluation eval = new Evaluation(dataFiltered); eval.crossValidateModel(classifier, dataFiltered, 3, new Random(1)); System. out .println(eval.toClassDetailsString()); System. out .println(eval.toSummaryString()); System. out .println(eval.toMatrixString()); } } 唯一要提醒的是 TextDirectoryLoader 这个类当然是要用第四步修改的类，不然出来的全是乱码。你可以把 dawData 和 dataFilted 打印出来看一下。最后我想解释一下产生的 arff 文件，它的类别在第一列，别搞错了。有的 weka 使用者可能以前没有见过压缩格式的 arff ，我举一个压缩格式的例子： {1 1,6 1,7 1,12 1,13 1} ，它表示第 2 个字段值为 1 ，第 7 个字段值为 1 ，第 8 个字段值为 1 ，第 13 个字段值为 1 ，第 14 个字段值为 1 。如果你用文本编辑器打开最后产生的 arff 文件，你可能会糊涂，怎么搞的，第一个类别没有？其实是第一个类别它的离散值就是 0 ，所以不显示。别激动，呵呵。 -------------------------------下面的内容为李向东添加：说明：上面这段代码的最后5句，其实是用weka的评估算法对naiveBayes的分类效果进行评估。如果只想利用naiveBayes对训练集进行训练并保存得到的分类模型，代码可以这么写： // train NaiveBayes and output model NaiveBayes classifier = new NaiveBayes(); classifier.buildClassifier(dataFiltered); // serialize model SerializationHelper.write("\\some\\where\\naivebayes.model", classifier); 这样，当你有个实例需要调用训练好的模型进行分类时，可以这么写： // deserialize model Classifier classifier = (Classifier) weka.core.SerializationHelper.read("\\some\\where\\naivebayes.model"); //prepare the sample String samplefile = "\\some\\where\\test\\singlesample.arff"; BufferedReader reader = new BufferedReader(new FileReader(samplefile)); ArffReader arff = new ArffReader(reader); Instances sampledata = arff.getData(); sampledata.setClassIndex(0); //classify double prediction = classifier.classifyInstance(sampledata.instance(0)); String category = sampledata.classAttribute().value((int)prediction); System.out.println("the sample is belong to: " + category); 但注意，仅当sampledata中的分类枚举信息是正确的，上面的语句才能返回正确的字符串值。

个人分类: weka|9651 次阅读|2 个评论

帐号		自动登录	找回密码
密码			注册

关闭安全验证

标签: 中文文本分类

相关帖子

相关日志

关闭 安全验证

标签: 中文文本分类

相关帖子

相关日志

关闭安全验证