博文

Weka中的特征选择(Attribute selection)

已有 22665 次阅读 2012-5-8 21:47 |个人分类:weka|系统分类:科研笔记|关键词:学者| 最优解, 特征选择, 文本分类, 贪婪算法

按照http://weka.wiki.sourceforge.net/Use+Weka+in+your+Java+code的说法，在使用weka进行分类时，其实没有必要在代码中直接使用特征选择类，因为已经有meta-classifier和filter可以进行特征选择。

Weka里有个称为AttributeSelectedClassifier的带有特征选择的分类器，和一个称为GreedyStepwise的搜索类。这个分类器会利用GreedyStepwise采用贪婪算法（或称贪心算法）不断地搜索特征值子集并且评估最优解。

下面就是用CfsSubsetEVal和GreedStepwise方法的例子。下面的meta-classifier在数据传给classifier之前，进行了一个预外理的步骤:

Instances data = ... // from somewhere

AttributeSelectedClassifier classifier = new

AttributeSelectedClassifier();

CfsSubsetEval eval = new CfsSubsetEval();

GreedyStepwise search = new GreedyStepwise();

search.setSearchBackwards(true);

J48 base = new J48();

classifier.setClassifier(base);

classifier.setEvaluator(eval);

classifier.setSearch(search);

// 10-fold cross-validation

Evaluation evaluation = new Evaluation(data);

evaluation.crossValidateModel(classifier, data, 10, new Random(1));

System.out.println(evaluation.toSummaryString());

说明：

（1）search.setSearchBackwards(true)是设置为反向搜索，即从最大子集开始，逐步减小的方法。

（2）10-fold cross-validation，就是十折交叉验证，用来评估算法准确度。是常用的精度测试方法。将数据集分成十份，轮流将其中的9份做训练、1份做测试，10次结果的均值作为对算法精度的估计。

（3）对GreedyStepwise类对象search，也可以使用search.setThreshold(double threshold) 设置门限值，以便特征选择模块丢弃一些特征。Set the threshold by which the AttributeSelection module can discard attributes.

（4）或者使用setNumToSelect(int n) 指定要选定多少个特征值。 Specify the number of attributes to select from the ranked list (if generating a ranking). -1 indicates that all attributes are to be retained.

（5）执行search.search(eval, data)之后，得到的特征子集到底是什么呢？这时已经搜索到了最优解，但看不到所有解。这时，可以调用rankedAttributes(),它会继续搜索其他解，并返回排序的全部解。rankedAttributes() ：Produces a ranked list of attributes. Search must have been performed prior to calling this function. Search is called by this function to complete the traversal of the search space. A list of attributes and merits are returned. The attributes are ranked by the order they are added to the subset during a forward selection search. Individual merit values reflect the merit associated with adding the corresponding attribute to the subset; because of this, merit values may initially increase but then decrease as the best subset is "passed by" on the way to the far side of the search space.

昨晚写完后没有实际验证，今天验证了一下，运行非常耗时，代码如下：

//load filtered train data arff file

ArffLoader loader = new ArffLoader();

loader.setFile(new File( "dataFiltered.arff" ));

Instances data = loader.getDataSet();

data.setClassIndex( 0 );

AttributeSelection attsel = new AttributeSelection();

CfsSubsetEval eval = new CfsSubsetEval();

GreedyStepwise search = new GreedyStepwise();

search.setSearchBackwards(true);

attsel.setEvaluator(eval);

attsel.setSearch(search);

attsel.SelectAttributes(data);

int attarray[] =attsel.selectedAttributes();

for (int i=0;i<attarray.length;i++ ){

System.out.println("the selected attributes are as follows: n");

System.out.println(data.classAttribute().value((int)attarray[i])+ "n");

}

weka给出的例子：

import weka.attributeSelection.*;

import weka.core.*;

import weka.core.converters.ConverterUtils.*;

import weka.classifiers.*;

import weka.classifiers.meta.*;

import weka.classifiers.trees.*;

import weka.filters.*;

import java.util.*;

/**

* performs attribute selection using CfsSubsetEval and GreedyStepwise

* (backwards) and trains J48 with that. Needs 3.5.5 or higher to compile.

* @author FracPete (fracpete at waikato dot ac dot nz)

public class AttributeSelectionTest {

/**

* uses the meta-classifier

protected static void useClassifier(Instances data) throws Exception {

System.out.println("n1. Meta-classfier");

AttributeSelectedClassifier classifier = new AttributeSelectedClassifier();

CfsSubsetEval eval = new CfsSubsetEval();

GreedyStepwise search = new GreedyStepwise();

search.setSearchBackwards(true);

J48 base = new J48();

classifier.setClassifier(base);

classifier.setEvaluator(eval);

classifier.setSearch(search);

Evaluation evaluation = new Evaluation(data);

evaluation.crossValidateModel(classifier, data, 10, new Random(1));

System.out.println(evaluation.toSummaryString());

}

/**

* uses the filter

protected static void useFilter(Instances data) throws Exception {

System.out.println("n2. Filter");

weka.filters.supervised.attribute.AttributeSelection filter = new weka.filters.supervised.attribute.AttributeSelection();

CfsSubsetEval eval = new CfsSubsetEval();

GreedyStepwise search = new GreedyStepwise();

search.setSearchBackwards(true);

filter.setEvaluator(eval);

filter.setSearch(search);

filter.setInputFormat(data);

Instances newData = Filter.useFilter(data, filter);

System.out.println(newData);

}

/**

* uses the low level approach

protected static void useLowLevel(Instances data) throws Exception {

System.out.println("n3. Low-level");

AttributeSelection attsel = new AttributeSelection();

CfsSubsetEval eval = new CfsSubsetEval();

GreedyStepwise search = new GreedyStepwise();

search.setSearchBackwards(true);

attsel.setEvaluator(eval);

attsel.setSearch(search);

attsel.SelectAttributes(data);

int[] indices = attsel.selectedAttributes();

System.out.println("selected attribute indices (starting with 0):n" + Utils.arrayToString(indices));

}

/**

* takes a dataset as first argument

* @param args the commandline arguments

* @throws Exception if something goes wrong

public static void main(String[] args) throws Exception {

// load data

System.out.println("n0. Loading data");

DataSource source = new DataSource(args[0]);

Instances data = source.getDataSet();

if (data.classIndex() == -1)

data.setClassIndex(data.numAttributes() - 1);

// 1. meta-classifier

useClassifier(data);

// 2. filter

useFilter(data);

// 3. low-level

useLowLevel(data);

}

转载本文请联系原作者获取授权，同时请注明本文来自李向东科学网博客。
链接地址：https://m.sciencenet.cn/blog-713110-568654.html

上一篇：基于weka的中文文本分类：对任意单个文本进行分类
下一篇：不做特征选择，就不知道去停词的重要性

收藏分享

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

数据加载中...

返回顶部

李向东

扫一扫，分享此博文

lixiangdong的个人博客分享 http://blog.sciencenet.cn/u/lixiangdong

博文

Weka中的特征选择(Attribute selection)

当前推荐数：0

该博文允许注册用户评论请点击登录评论 (0 个评论)

李向东

全部精选博文导读

相关博文

lixiangdong的个人博客分享 http://blog.sciencenet.cn/u/lixiangdong

博文

Weka中的特征选择(Attribute selection)

当前推荐数：0

该博文允许注册用户评论 请点击登录 评论 (0 个评论)

李向东

全部精选博文导读

相关博文

该博文允许注册用户评论请点击登录评论 (0 个评论)