||||
按照http://weka.wiki.sourceforge.net/Use+Weka+in+your+Java+code的说法,在使用weka进行分类时,其实没有必要在代码中直接使用特征选择类,因为已经有meta-classifier和filter可以进行特征选择。
Weka里有个称为AttributeSelectedClassifier的带有特征选择的分类器,和一个称为GreedyStepwise的搜索类。这个分类器会利用GreedyStepwise采用贪婪算法(或称贪心算法)不断地搜索特征值子集并且评估最优解。
下面就是用CfsSubsetEVal和GreedStepwise方法的例子。下面的meta-classifier在数据传给classifier之前,进行了一个预外理的步骤:
Instances data = ... // from somewhere
AttributeSelectedClassifier classifier = new
AttributeSelectedClassifier();
CfsSubsetEval eval = new CfsSubsetEval();
GreedyStepwise search = new GreedyStepwise();
search.setSearchBackwards(true);
J48 base = new J48();
classifier.setClassifier(base);
classifier.setEvaluator(eval);
classifier.setSearch(search);
// 10-fold cross-validation
Evaluation evaluation = new Evaluation(data);
evaluation.crossValidateModel(classifier, data, 10, new Random(1));
System.out.println(evaluation.toSummaryString());
说明:
(1)search.setSearchBackwards(true)是设置为反向搜索,即从最大子集开始,逐步减小的方法。
(2)10-fold cross-validation, 就是十折交叉验证,用来评估算法准确度。是常用的精度测试方法。将数据集分成十份,轮流将其中的9份做训练、1份做测试,10次结果的均值作为对算法精度的估计。
(3)对GreedyStepwise类对象search,也可以使用search.setThreshold(double threshold) 设置门限值,以便特征选择模块丢弃一些特征。Set the threshold by which the AttributeSelection module can discard attributes.
(4)或者使用setNumToSelect(int n) 指定要选定多少个特征值。 Specify the number of attributes to select from the ranked list (if generating a ranking). -1 indicates that all attributes are to be retained.
(5)执行search.search(eval, data)之后,得到的特征子集到底是什么呢?这时已经搜索到了最优解,但看不到所有解。这时,可以调用rankedAttributes(),它会继续搜索其他解,并返回排序的全部解。rankedAttributes() :Produces a ranked list of attributes. Search must have been performed prior to calling this function. Search is called by this function to complete the traversal of the search space. A list of attributes and merits are returned. The attributes are ranked by the order they are added to the subset during a forward selection search. Individual merit values reflect the merit associated with adding the corresponding attribute to the subset; because of this, merit values may initially increase but then decrease as the best subset is "passed by" on the way to the far side of the search space.
昨晚写完后没有实际验证,今天验证了一下,运行非常耗时,代码如下:
//load filtered train data arff file ArffLoader loader = new ArffLoader(); loader.setFile(new File( "dataFiltered.arff" )); Instances data = loader.getDataSet(); data.setClassIndex( 0 );
AttributeSelection attsel = new AttributeSelection(); CfsSubsetEval eval = new CfsSubsetEval(); GreedyStepwise search = new GreedyStepwise(); search.setSearchBackwards(true); attsel.setEvaluator(eval); attsel.setSearch(search); attsel.SelectAttributes(data);
int attarray[] =attsel.selectedAttributes(); for (int i=0;i<attarray.length;i++ ){ System.out.println("the selected attributes are as follows: n"); System.out.println(data.classAttribute().value((int)attarray[i])+ "n"); }
|
weka给出的例子:
import weka.attributeSelection.*; import weka.core.*; import weka.core.converters.ConverterUtils.*; import weka.classifiers.*; import weka.classifiers.meta.*; import weka.classifiers.trees.*; import weka.filters.*; import java.util.*; /** * performs attribute selection using CfsSubsetEval and GreedyStepwise * (backwards) and trains J48 with that. Needs 3.5.5 or higher to compile. * * @author FracPete (fracpete at waikato dot ac dot nz) */ public class AttributeSelectionTest { /** * uses the meta-classifier */ protected static void useClassifier(Instances data) throws Exception { System.out.println("n1. Meta-classfier"); AttributeSelectedClassifier classifier = new AttributeSelectedClassifier(); CfsSubsetEval eval = new CfsSubsetEval(); GreedyStepwise search = new GreedyStepwise(); search.setSearchBackwards(true); J48 base = new J48(); classifier.setClassifier(base); classifier.setEvaluator(eval); classifier.setSearch(search); Evaluation evaluation = new Evaluation(data); evaluation.crossValidateModel(classifier, data, 10, new Random(1)); System.out.println(evaluation.toSummaryString()); } /** * uses the filter */ protected static void useFilter(Instances data) throws Exception { System.out.println("n2. Filter"); weka.filters.supervised.attribute.AttributeSelection filter = new weka.filters.supervised.attribute.AttributeSelection(); CfsSubsetEval eval = new CfsSubsetEval(); GreedyStepwise search = new GreedyStepwise(); search.setSearchBackwards(true); filter.setEvaluator(eval); filter.setSearch(search); filter.setInputFormat(data); Instances newData = Filter.useFilter(data, filter); System.out.println(newData); } /** * uses the low level approach */ protected static void useLowLevel(Instances data) throws Exception { System.out.println("n3. Low-level"); AttributeSelection attsel = new AttributeSelection(); CfsSubsetEval eval = new CfsSubsetEval(); GreedyStepwise search = new GreedyStepwise(); search.setSearchBackwards(true); attsel.setEvaluator(eval); attsel.setSearch(search); attsel.SelectAttributes(data); int[] indices = attsel.selectedAttributes(); System.out.println("selected attribute indices (starting with 0):n" + Utils.arrayToString(indices)); } /** * takes a dataset as first argument * * @param args the commandline arguments * @throws Exception if something goes wrong */ public static void main(String[] args) throws Exception { // load data System.out.println("n0. Loading data"); DataSource source = new DataSource(args[0]); Instances data = source.getDataSet(); if (data.classIndex() == -1) data.setClassIndex(data.numAttributes() - 1); // 1. meta-classifier useClassifier(data); // 2. filter useFilter(data); // 3. low-level useLowLevel(data); } }
Archiver|手机版|科学网 ( 京ICP备07017567号-12 )
GMT+8, 2024-6-18 22:01
Powered by ScienceNet.cn
Copyright © 2007- 中国科学报社