机器学习工具WEKA的使用总结包括算法选择属性选择参数优化.docx

资源描述

机器学习工具WEKA的使用总结包括算法选择属性选择参数优化.docx

《机器学习工具WEKA的使用总结包括算法选择属性选择参数优化.docx》由会员分享，可在线阅读，更多相关《机器学习工具WEKA的使用总结包括算法选择属性选择参数优化.docx（15页珍藏版）》请在冰豆网上搜索。

机器学习工具WEKA的使用总结包括算法选择属性选择参数优化.docx

机器学习工具WEKA的使用总结包括算法选择属性选择参数优化

一、属性选择：

1、理论知识：

见以下两篇文章：

数据挖掘中的特征选择算法综述及基于WEKA的性能比较_良龙

数据挖掘中约简技术与属性选择的研究_辉

2、weka中的属性选择

2.1评价策略（attributeevaluator）

总的可分为filter和wrapper方法，前者注重对单个属性进行评价，后者侧重对特征子集进行评价。

Wrapper方法有：

CfsSubsetEval

Filter方法有：

CorrelationAttributeEval

2.1.1Wrapper方法：

（1）CfsSubsetEval

根据属性子集中每一个特征的预测能力以及它们之间的关联性进行评估，单个特征预测能力强且特征子集的相关性低的子集表现好。

Evaluatestheworthofasubsetofattributesbyconsideringtheindividualpredictiveabilityofeachfeaturealongwiththedegreeofredundancybetweenthem.Subsetsoffeaturesthatarehighlycorrelatedwiththeclasswhilehavinglowintercorrelationarepreferred.

Formoreinformationsee:

M.A.Hall（1998）.Correlation-basedFeatureSubsetSelectionforMachineLearning.Hamilton,NewZealand.

（2）WrapperSubsetEval

Wrapper方法中，用后续的学习算法嵌入到特征选择过程中，通过测试特征子集在此算法上的预测性能来决定其优劣，而极少关注特征子集中每个特征的预测性能。

因此，并不要求最优特征子集中的每个特征都是最优的。

Evaluatesattributesetsbyusingalearningscheme.Crossvalidationisusedtoestimatetheaccuracyofthelearningschemeforasetofattributes.

Formoreinformationsee:

RonKohavi,GeorgeH.John（1997）.Wrappersforfeaturesubsetselection.ArtificialIntelligence.97（1-2）:

273-324.

2.1.2Filter方法：

如果选用此评价策略，则搜索策略必须用Ranker。

（1）CorrelationAttributeEval

根据单个属性和类别的相关性进行选择。

Evaluatestheworthofanattributebymeasuringthecorrelation（Pearson's）betweenitandtheclass.

Nominalattributesareconsideredonavaluebyvaluebasisbytreatingeachvalueasanindicator.Anoverallcorrelationforanominalattributeisarrivedatviaaweightedaverage.

（2）GainRatioAttributeEval

根据信息增益比选择属性。

Evaluatestheworthofanattributebymeasuringthegainratiowithrespecttotheclass.

GainR（Class,Attribute）=（H（Class）-H（Class|Attribute））/H（Attribute）.

（3）InfoGainAttributeEval

根据信息增益选择属性。

Evaluatestheworthofanattributebymeasuringtheinformationgainwithrespecttotheclass.

InfoGain（Class,Attribute）=H（Class）-H（Class|Attribute）.

（4）OneRAttributeEval

根据OneR分类器评估属性。

Classforbuildingandusinga1Rclassifier;inotherwords,usestheminimum-errorattributeforprediction,discretizingnumericattributes.Formoreinformation,see:

R.C.Holte（1993）.Verysimpleclassificationrulesperformwellonmostcommonlyuseddatasets.MachineLearning.11:

63-91.

（5）PrincipalComponents

主成分分析（PCA）。

Performsaprincipalcomponentsanalysisandtransformationofthedata.UseinconjunctionwithaRankersearch.Dimensionalityreductionisaccomplishedbychoosingenougheigenvectorstoaccountforsomepercentageofthevarianceintheoriginaldata---default0.95（95%）.AttributenoisecanbefilteredbytransformingtothePCspace,eliminatingsomeoftheworsteigenvectors,andthentransformingbacktotheoriginalspace.

（6）ReliefFAttributeEval

根据ReliefF值评估属性。

Evaluatestheworthofanattributebyrepeatedlysamplinganinstanceandconsideringthevalueofthegivenattributeforthenearestinstanceofthesameanddifferentclass.Canoperateonbothdiscreteandcontinuousclassdata.

Formoreinformationsee:

KenjiKira,LarryA.Rendell:

APracticalApproachtoFeatureSelection.In:

NinthInternationalWorkshoponMachineLearning,249-256,1992.

IgorKononenko:

EstimatingAttributes:

AnalysisandExtensionsofRELIEF.In:

EuropeanConferenceonMachineLearning,171-182,1994.

MarkoRobnik-Sikonja,IgorKononenko:

AnadaptationofReliefforattributeestimationinregression.In:

FourteenthInternationalConferenceonMachineLearning,296-304,1997.

（7）SymmetricalUncertAttributeEval

根据属性的对称不确定性评估属性。

Evaluatestheworthofanattributebymeasuringthesymmetricaluncertaintywithrespecttotheclass.

SymmU（Class,Attribute）=2*（H（Class）-H（Class|Attribute））/H（Class）+H（Attribute）.

2.2搜索策略（SearchMethod）

2.2.1和评价策略中的wrapper方法对应

（1）BestFirst

最好优先的搜索策略。

是一种贪心搜索策略。

Searchesthespaceofattributesubsetsbygreedyhillclimbingaugmentedwithabacktrackingfacility.Settingthenumberofconsecutivenon-improvingnodesallowedcontrolsthelevelofbacktrackingdone.Bestfirstmaystartwiththeemptysetofattributesandsearchforward,orstartwiththefullsetofattributesandsearchbackward,orstartatanypointandsearchinbothdirections（byconsideringallpossiblesingleattributeadditionsanddeletionsatagivenpoint）.

（2）ExhaustiveSearch

穷举搜索所有可能的属性子集。

Performsanexhaustivesearchthroughthespaceofattributesubsetsstartingfromtheemptysetofattrubutes.Reportsthebestsubsetfound.

（3）GeneticSearch

基于Goldberg在1989年提出的简单遗传算法进行的搜索。

PerformsasearchusingthesimplegeneticalgorithmdescribedinGoldberg（1989）.

Formoreinformationsee:

DavidE.Goldberg（1989）.Geneticalgorithmsinsearch,optimizationandmachinelearning.Addison-Wesley.

（4）GreedyStepwise

向前或向后的单步搜索。

Performsagreedyforwardorbackwardsearchthroughthespaceofattributesubsets.Maystartwithno/allattributesorfromanarbitrarypointinthespace.Stopswhentheaddition/deletionofanyremainingattributesresultsinadecreaseinevaluation.Canalsoproducearankedlistofattributesbytraversingthespacefromonesidetotheotherandrecordingtheorderthatattributesareselected.

（5）RandomSearch

随机搜索。

PerformsaRandomsearchinthespaceofattributesubsets.Ifnostartsetissupplied,Randomsearchstartsfromarandompointandreportsthebestsubsetfound.Ifastartsetissupplied,Randomsearchesrandomlyforsubsetsthatareasgoodorbetterthanthestartpointwiththesameororfewerattributes.UsingRandomSearchinconjunctionwithastartsetcontainingallattributesequatestotheLVFalgorithmofLiuandSetiono（ICML-96）.

Formoreinformationsee:

H.Liu,R.Setiono:

Aprobabilisticapproachtofeatureselection-Afiltersolution.In:

13thInternationalConferenceonMachineLearning,319-327,1996.

（6）RankSearch

用一个评估器计算属性判据值并排序。

Usesanattribute/subsetevaluatortorankallattributes.Ifasubsetevaluatorisspecified,thenaforwardselectionsearchisusedtogeneratearankedlist.Fromtherankedlistofattributes,subsetsofincreasingsizeareevaluated,ie.Thebestattribute,thebestattributeplusthenextbestattribute,etc....Thebestattributesetisreported.RankSearchislinearinthenumberofattributesifasimpleattributeevaluatorisusedsuchasGainRatioAttributeEval.Formoreinformationsee:

MarkHall,GeoffreyHolmes（2003）.Benchmarkingattributeselectiontechniquesfordiscreteclassdatamining.IEEETransactionsonKnowledgeandDataEngineering.15（6）:

1437-1447.

2.2.2和评价策略中的filter方法对应

（1）Ranker:

对属性的判据值进行排序，和评价策略中的Filter方法结合使用。

Ranksattributesbytheirindividualevaluations.Useinconjunctionwithattributeevaluators（ReliefF,GainRatio,Entropyetc）.

3、我的总结

针对某一算法及其参数设置，选用WrapperSubsetEval评价策略和ExhaustiveSearch搜索策略，能够保证找到适合该算法即参数设置的最优属性子集。

但其计算时间较长，并且随着属性个数的增多成指数级增长。

二、参数优化

针对某一特定算法，进行参数优化有以下三种方法：

CVParameterSelection、GridSearch、MultiSearch。

1、CVParameterSelection

采用交叉验证的方法，对参数进行优化选择。

优点：

可以对任意数量的参数进行优化选择；

缺点：

①参数太多时，可能造成参数组合数量的爆炸性增长；②只能优化分类器的直接参数，不能优化其嵌入的参数，比如可以优化weka.classifiers.functions.SMO里的参数C，但不能优化weka.classifiers.meta.FilteredClassifier中的嵌入算法weka.classifiers.functions.SMO里的参数C。

示例：

优化J48算法的置信系数C

1载数据集；

2选择 weka.classifiers.meta.CVParameterSelection作为分类器；

3选择weka.classifiers.trees.J48作为②的基分类器；

4参数优化的字符串：

C0.10.55（优化参数C，围是从0.1至0.5，步距是0.5/5=0.1）

5进行运算，得到如下图所示的结果（最后一行是优化的参数）：

2、GridSearch

采用网格搜索，而不是试验所有的参数组合，进行参数的选择。

优点：

①理论上，相同的优化围及设置，GridSearch应该比CVParameterSelection要快；②不限于优化分类器的直接参数，也可以优化其嵌入算法的参数；③优化的2个参数中，其中一个可以是filter里的参数，所以需要在属性表达式中加前缀classifier.或filter.;④支持围的自动扩展。

缺点：

最多优化2个参数。

示例：

优化以RBFKernel为核的SMO算法的参数

1加载数据集；

2选择GridSearch为Classifier；

3选择GridSearch的Classifier为weka.classifiers.functions.SMO ，kernel为weka.classifiers.functions.supportVector.RBFKernel。

4设置X参数。

XProperty：

classifier.c,XMin:

1,XMax:

16,XStep:

1,XExpression:

I。

这的意思是：

选择参数c,其围是1到16，步长1。

5设置Y参数。

YProperty：

"classifier.kernel.gamma,YMin:

-5,YMax:

2,YStep:

1,YBase：

10,YExpression:

pow（BASE,I）。

这的意思是：

选择参数kernel.gamma,其围是10-5,10-4,…,102。

6输出如下（最后一行是优化的参数）：

3、MultiSearch

类似网格参数，但更普通更简单。

优点：

①不限于优化分类器的直接参数，也可以优化其嵌入算法的参数或filter的参数；②支持任意数量的参数优化；

缺点：

不支持自动扩展边界。

4、我的总结

①如果需要优化的参数不大于2个，选用gridsearch，并且设置边界自动扩展；

②如果需要优化的参数大于2个，选用MultiSearch；

③如果优化分类器的直接参数，且参数数量不大于2个，也可以考虑用CVParameterSelection。

三、meta-Weka的算法

1、算法及描述

LocalWeightedLearning：

局部加权学习；

AdaBoostM1：

AdaBoost方法；

AdditiveRegression：

GBRT（GrandientBoostingRegressionTree）梯度下降回归树。

是属于Boosting算法，也是将多分类器进行级联训练，后一级的分类器则更多关注前面所有分类器预测结果与实际结果的残差，在这个残差上训练新的分类器，最终预测时将残差级联相加。

AttributeSelectedClassifier：

将属性选择和分类器集成设置，先进行属性选择、再进行分类或回归；

Bagging：

bagging方法；

ClassificationViaRegression：

用回归的方法进行分类；

LogitBoost：

是一种boosting算法，用回归进行分类。

MultiClassClassifier：

使用两类分类器进行多类分类的方法。

RondomCommittee：

随机化基分类器结果的平均值作为结果。

RandomSubspace；

FilterClassifier：

将过滤器和分类器集成设置，先进行过滤、再进行分类或回归；（autoweka中没有）

MultiScheme：

在所指定的多个分类器或多种参数配置中，选择最优的一个。

（犹如experiment）（autoweka中没有）

RandomizableFitteredClassifier：

是FilterClassifier的变体，对于RondomCommittee的ensembleclassifiers是很有用的。

要求不管是filter还是classifier都支持randomizable接口。

（autoweka中没有）

Vote；

Stacking。

2、我的总结

Meta提供了很多以基分类器为输入的方法，其中：

①AdaBoostM1和Bagging方法是常用的meta方法；

②MultiScheme和experiment的功能类似；

③AttributeSelectedClassifier将属性选择和分类器集成设置，比较方便。

四、Auto-WEKA

Auto-WEKA支持属性、算法、参数的自动选择。

1、属性选择

属性选择作为数据的预处理步骤，在分类或回归前运行。

Auto-WEKA中属性选择的评价策略和搜索策略如上图所示。

其中标*的是搜索策略，其余的是评价策略。

可见，不包括WrapperSubsetEval评价策略和ExhaustiveSearch搜索策略组合的完备搜索。

2、算法选择

上图是Auto-WEKA中包括的分类或回归算法，共39种：

27种基分类器、10种meta分类器、2种ensemble分类器。

其中，meta分类器可以选任意一种基分类器作为输入，ensemble分类器可以使用最多5种基分类器作为输入。

27种基分类器包括：

Bayes里的3种：

BayesNet、NaiveBayes、和NaiveBayesMultinomial；

Functions里的9种：

GaussianProcesses、LinearRegression、LogisticRegression、SingleLayerPerceptron、SGD、SVM、SimpleLinearRegression、SimpleLogistivRegression、VotedPerceptron。

注意，没有常用的MultilayerPerceptron、RBFClassifier和RBFNetwork。

Lazy里的2种：

KNN、KStar（*）。

Rules里的6种：

Dec

展开阅读全文