网上关于使用weka进行中文分类的文章很多了。这里只讨论一个具体的问题:已经有了一个训练集,也选用了一种分类器如NaiveBayes,并进行了训练得到了一个分类模型,那么对于任意一个中文文本现在如何使用那个分类模型进行分类呢? 这里的一个关键问题是,对一个单个文本进行量化(weka的术语是filter,即过滤)时,如何保证得到的arff文件的属性项和顺序与分类模型中的是一致的呢? 首先,分类训练过程没必要重复。如果可以重复,那可以使用weka的batch filtering功能,先filter训练集,再filter待分类文本即可。 我想过用训练集filter之后的arff作为输入格式,即 filter.setInputFormat(traindataFiltered); Instances testFiltered = Filter.useFilter(testRaw, filter); 但weka提示这不允许。这种方法是否可行,还有待研究,也期待高手指点。 于是,我采用一种笨办法:先对待分类文本进行分词、保存到和训练数据相同的一个目录结构里、使用TextDirectoryLoader转换成原始arff、使用和训练集一样的filter进行过滤,然后采用 http://weka.wikispaces.com/file/view/M5PExample.java 中的方法,根据训练集instances的header即结构信息,比较待分类文本中的每个属性,从而生成一个新的instance。最后,使用训练好的模型对这个instance进行分类。 下面是代码: /*trained attributes need to be loaded first*/ String trainsetfile = "dataFiltered.arff"; BufferedReader reader = new BufferedReader(new FileReader(trainsetfile)); ArffReader arff = new ArffReader(reader); Instances trainheader = arff.getStructure(); trainheader.setClassIndex(0); // deserialize a model if there already is such one Classifier classifier = (Classifier) weka.core.SerializationHelper.read(Modelpath); /* you should prepare the sample in two steps: * 1. segment 2. filter in the same way the train dataset was done. * It is supposed that the sample text file is not segmented yet and is stored in the file folder of * the sample directory which has the same directory structure as the traindata. */ segmenter segmenter = new segmenter( sampledirectory, "test_segmented" ); segmenter.segment(); // convert the "test_segmented" directory into a dataset TextDirectoryLoader loader = new TextDirectoryLoader(); loader.setDirectory(new File( "test_segmented" )); Instances sampleRaw = loader.getDataSet(); // we first filter it blindly StringToWordVector filter = new StringToWordVector(); filter.setStemmer( new NullStemmer() ); filter.setInputFormat(sampleRaw); Instances testFiltered = Filter.useFilter(sampleRaw, filter); //then we transform the instances according to the train data structure(header) // see also: http://weka.wikispaces.com/file/view/M5PExample.java Instances header=trainheader ; Instances data=testFiltered ; for (int i = 0; i data.numInstances(); i++) { Instance curr = data.instance(i); Instance inst = new Instance(header.numAttributes()); inst.setDataset(header); for (int n = 0; n header.numAttributes(); n++) { Attribute att = data.attribute(header.attribute(n).name()); // original attribute is also present in the current dataset if (att != null) { if (att.isNominal()) { if ((header.attribute(n).numValues() 0) (att.numValues() 0)) { String label = curr.stringValue(att); int index = header.attribute(n).indexOfValue(label); if (index != -1) inst.setValue(n, index); } } else if (att.isNumeric()) { inst.setValue(n, curr.value(att)); } else { throw new IllegalStateException("Unhandled attribute type!"); } } } double prediction = classifier.classifyInstance(inst); String category = trainheader.classAttribute().value((int)prediction); System.out.println("the sample is belong to: " + category); } if (testFiltered.numInstances()==0) System.out.println("there is no instance found in the arff file.");