Weka中贝叶斯网络学习情况小结.docx-资源下载

Weka中贝叶斯网络学习情况小结.docx

1、Weka中贝叶斯网络学习情况小结Weka中贝叶斯网络学习情况小结Weka中对于贝叶斯网络的学习，仅仅看相关的几个包几乎是不可能的，结果还是一叶障目不见泰山。后来发现就那些代码死磕根本不行，还得采取灵活的方式方法。一方面采用学习人家博客的总结，逐步梳理weka中那些类的组织和功能函数的情况。一方面下载了一个weka源代码分析的中文翻译包，能够从更加宽的领域理解BN的相关重要函数以及操作。随便找个类，比如instance和instances，比如evaluation，都是几百行，或者千余行，通读一遍都很费力，而且许多类有继承，许多地方用的是接口，很少有地方用抽象类。各个类之间的关联较多，随便引用一

2、下，在某个函数中出现一下，都是关系，本来以为UML图能够方便的解决问题，但是拖动以后发现，事情越高越多，线条也越来越多，实际上展示效力快速下降，到最后还是不能够说明问题。刚开始研究原理，其实朴素贝叶斯和贝叶斯网络的基本原理并不复杂，小型网络手算是可以的，有点像矩阵，小矩阵加减乘除都没问题，但是大型矩阵就得想方设法用计算机、写代码来实现了。数学上涉及的东西，写道计算机上就需要一些辅助的东西了。特别是程序实现上。事实上，学习概论的东西本身也要小心，一不留神还是会犯下各种错误。在软件实现上，读入arff数据，构建ADTree，然后训练分类器，最后进行测试。An alternating decisio

3、n tree (ADTree) is a machine learning method for classification. It generalizes decision trees and has connections to boosting.An alternating decision tree consists of decision nodes and prediction nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single number. ADTrees

4、 always have prediction nodes as both root and leaves. An instance is classified by an ADTree by following all paths for which all decision nodes are true and summing any prediction nodes that are traversed. This is different from binary classification trees such as CART (Classification and regressi

5、on tree) or C4.5 in which an instance follows only one path through the tree.http:/en.wikipedia.org/wiki/ADTree 这个网站给的例子可以好好理解一下、weka在进行实例测试的时候也有许多术语和内容需要继续认识，比如recall，precision，confusion matrix，以及其他指标。评分函数（score），在软件中归在evaluation中。CPT，也就是conditional probability table也是建立网络的关键。有一个说法：Bays = Bs(DAG)+B

6、p(CPT) weka 中的Beiyes有两个要求，一个是离散化的数据，另一个是数据的值不能是null 整个学习的过程是先构建DAG在学习出CPT。记录一点代码分析：在buildClassifier函数中，重要的几行是：/ build the network structureinitStructure();/ build the network structurebuildStructure();/ build the set of CPTsestimateCPTs();函数initStructure为初始化网络结构，buildStructure为构造网络结构，estimateCPTs为计算

7、条件概率表(conditional probability table)。/*Initstructureinitializesthestructuretoanemptygraph*oraNaiveBayesgraph(dependingonthe-Nflag).*/publicvoidinitStructure()throwsException / reserve memorym_ParentSets=newParentSetm_Instances.numAttributes();for(intiAttribute = 0; iAttribute m_Instances.numAttribut

8、es();iAttribute+) m_ParentSetsiAttribute =newParentSet(m_Instances.numAttributes();/ initStructurem_ParentSets是记录下第i个属性(iAttribute)的父结点，ParentSet初始函数为：publicParentSet(intnMaxNrOfParents) m_nParents=newintnMaxNrOfParents;m_nNrOfParents= 0;m_nCardinalityOfParents= 1;/ ParentSet不做什么，也就是一个空图。接下来看buildSt

9、ructure，它会调用SearchAlgorithm中的buildStructure：/*buildStructuredeterminesthenetworkstructure/graphofthenetwork.*Thedefaultbehavioriscreatinganetworkwhereallnodeshave*thefirstnodeasitsparent(i.e.,aBayesNetthatbehaveslike*anaiveBayesclassifier).Thismethodcanbeoverriddenbyderived*classestorestricttheclass

10、ofnetworkstructuresthatareacceptable.*/publicvoidbuildStructure(BayesNet bayesNet, Instances instances)throwsException if(m_bInitAsNaiveBayes) intiClass = instances.classIndex();/ initialize parent sets to have arrow from classifier node to/ each of the other nodesfor(intiAttribute = 0; iAttribute i

11、nstances.numAttributes();iAttribute+) if(iAttribute != iClass) bayesNet.getParentSet(iAttribute).addParent(iClass,instances);search(bayesNet, instances);if(m_bMarkovBlanketClassifier) doMarkovBlanketCorrection(bayesNet, instances);/ buildStructure这里会判断是不是初始化成朴素贝叶斯，如果不初始化为朴素贝叶斯，那么就还是空图，如果初始为朴素贝叶斯，则对于

12、每个属性将类别属性加为父结点。addParent的代码如下：publicvoidaddParent(intnParent, Instances _Instances) if(m_nNrOfParents= 10) / reserve more memoryint nParents =newint50;for(inti = 0; i m_nNrOfParents; i+) nParentsi =m_nParentsi;m_nParents= nParents;m_nParentsm_nNrOfParents = nParent;m_nNrOfParents+;m_nCardinalityOfPa

13、rents*= _Instances.attribute(nParent).numValues();/ AddParent前面的if是预保留内存的代码，后面的是保存哪个属性是它的父结点，m_NrOfParent是父结点数，CardinalityOfParents是父结点所能取的所有属性值之和。search函数的实现有很多，这里看K2的代码实现：intnOrder =newintinstances.numAttributes();nOrder0 = instances.classIndex();intnAttribute = 0;for(intiOrder = 1; iOrder instanc

14、es.numAttributes(); iOrder+) if(nAttribute = instances.classIndex() nAttribute+;nOrderiOrder = nAttribute+;nOrder中类别属性下标为0，其实它属性顺序还是一样的。/ determine base scoresdouble fBaseScores =newdoubleinstances.numAttributes();for(intiOrder = 0; iOrder instances.numAttributes(); iOrder+) intiAttribute = nOrderiO

15、rder;fBaseScoresiAttribute = calcNodeScore(iAttribute);计算base scores，调用calcNodeScore函数：publicdoublecalcNodeScore(intnNode) if(m_BayesNet.getUseADTree() &m_BayesNet.getADTree() !=null) returncalcNodeScoreADTree(nNode);elsereturncalcNodeScorePlain(nNode);ADTree就暂时不去理会了，看calcNodeScorePlain函数：/ estimate

16、 distributionsEnumeration enumInsts = instances.enumerateInstances();while(enumInsts.hasMoreElements() Instance instance = (Instance) enumInsts.nextElement();/ updateClassifier;doubleiCPT = 0;for(intiParent = 0; iParent oParentSet.getNrOfParents();iParent+) intnParent = oParentSet.getParent(iParent)

17、;iCPT = iCPT * instances.attribute(nParent).numValues()+ instance.value(nParent);nCountsnumValues * (int) iCPT) + (int) instance.value(nNode)+;这里的nCounts是文章Bayesian Network Classifiers in Weka中第4页所提到的Nijk，这里是将i,j,k三维放到了一些，类别值是最后的instance.value(nNode)。在calcNodeScorePlain函数中最后调用了calcScoreOfCount函数：for

18、(intiParent = 0; iParent nCardinality; iParent+) switch(m_nScoreType) case(Scoreable.BAYES): doublenSumOfCounts = 0;for(intiSymbol = 0; iSymbol numValues; iSymbol+) if(m_fAlpha+ nCountsiParent * numValues + iSymbol != 0) fLogScore += Statistics.lnGamma(m_fAlpha+ nCountsiParent * numValues + iSymbol)

19、;nSumOfCounts +=m_fAlpha+ nCountsiParent * numValues + iSymbol;if(nSumOfCounts != 0) fLogScore -= Statistics.lnGamma(nSumOfCounts);if(m_fAlpha!= 0) fLogScore -= numValues * Statistics.lnGamma(m_fAlpha);fLogScore += Statistics.lnGamma(numValues *m_fAlpha);可以看Bayesian Network Classifiers in Weka第6页中的B

20、ayesian metric中的公式，第一个for是计算Gamma(Nijkprime+ Nijk)。接下来是计算Gamma(Nijkprime+ Nij)，再将下来是计算Gamma(Nijprime)/Gamma(Nijprime+ Nij)。case(Scoreable.MDL):case(Scoreable.AIC):case(Scoreable.ENTROPY): doublenSumOfCounts = 0;for(intiSymbol = 0; iSymbol numValues; iSymbol+) nSumOfCounts += nCountsiParent * numValu

21、es + iSymbol;for(intiSymbol = 0; iSymbol 0) fLogScore += nCountsiParent * numValues + iSymbol* Math.log(nCountsiParent * numValues+ iSymbol/ nSumOfCounts);这里相应于Bayesian Network Classifiers in Weka第5页的公式(2)不同之处是没有N，因为它可以消掉。switch(m_nScoreType) case(Scoreable.MDL): fLogScore -= 0.5 * nCardinality * (n

22、umValues - 1)* Math.log(instances.numInstances();break;case(Scoreable.AIC): fLogScore -= nCardinality * (numValues - 1);break;公式中的K=nCardinality * (numValues - 1)，N=instances.numInstances()。见公式(3)，MDL和AIC的计算见公式(5)。/K2 algorithm: greedy search restricted by orderingfor(intiOrder = 1; iOrder instances

23、.numAttributes(); iOrder+) intiAttribute = nOrderiOrder;doublefBestScore = fBaseScoresiAttribute;booleanbProgress = (bayesNet.getParentSet(iAttribute).getNrOfParents() getMaxNrOfParents();while(bProgress) intnBestAttribute = -1;for(intiOrder2 = 0; iOrder2 fBestScore) fBestScore = fScore;nBestAttribu

24、te = iAttribute2;if(nBestAttribute != -1) bayesNet.getParentSet(iAttribute).addParent(nBestAttribute,instances);fBaseScoresiAttribute = fBestScore;bProgress = (bayesNet.getParentSet(iAttribute).getNrOfParents() getMaxNrOfParents();elsebProgress =false;bProgress是判断iAttribute结点的父结点是否超过最大父结点数，代码逻辑大致是：对

25、每个属性(iOrder)得到它的父结点，找它的父结点是在0-iOrder中找，不然就循环了。对于每个属性调用calcScoreWithExtraParent函数计算得到，如果它比以前的结分高，那它成为最好的属性，调用addParent加入。publicdoublecalcScoreWithExtraParent(intnNode,intnCandidateParent) ParentSet oParentSet =m_BayesNet.getParentSet(nNode);/ sanity check: nCandidateParent should not be in parent set

26、 alreadyif(oParentSet.contains(nCandidateParent) return-1e100;/ set up candidate parentoParentSet.addParent(nCandidateParent,m_BayesNet.m_Instances);/ calculate the scoredoublelogScore = calcNodeScore(nNode);/ delete temporarily added parentoParentSet.deleteLastParent(m_BayesNet.m_Instances);returnlogScore;/ Ca

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？