数据挖掘复习题和答案.docx-资源下载

数据挖掘复习题和答案.docx

1、数据挖掘复习题和答案考虑表中二元分类问題的训练样本集表48练习3的数据集宴例6O3目标类ITT1.0+2TT6.0+3TF5.04FF4Q+5FT7.06FT3.07FF8.08TF7.0+9FT5.01.整个训练样本集关于类属性的嫡是多少2.关于这些训练集中a1,a2的信息增益是多少3.对于连续属性a3,计算所有可能的划分的信息增益。4.根据信息增益，a1, a2, a3哪个是罠佳划分5.*6.根据分类错误率，a1,a2哪具最佳7.根gini指标，a1,a2哪个最佳答1EXamPIeS for COmPUting EntrOPyEntrOPy = - p(j /) Iog2 p(j /)CI

2、0C26P(CI) = O/6 = 0 P(C2) = 6/6 = 1EntrOPy - 0 IOg O-IlOgl=-O-O=ORi1C25P(CI) = 1/6P(C2) = 5/6EntrOPy = - (1/6) Iog2 (1/6)- (5/6) Iog2 (5/6) = 0.65CI2C24P(Cl) = 2/6P(+)二 4/9 and P( -) = 5/9P(C2) = 4/6EntrOPy = 一(2/6) log? (2/6)-(4/6) Iog2 (4/6) = 0.92-4/9 Iog(4/9) - 5/9 log(5/9)二答2:SPlitting BaSecI O

3、n INFO. InfOrmatiOn Gain:GAlN . = EntrOPy(P)-(-Entropy(I)Parent Node, P is SPIit into k PartrtiOns;nl is number Of records in PartitiOn i一 MeaSUreS RedUCtiOn in EntrOPy achieved because Of the SPIit ChOOSe the SPlit that achieves most reductiOn (maximizes GAIN)一 USed in ID3 and C4.5一 DiSadVantage: T

4、endS to Prefer SPlitS that result in Iarge nUmber Of PartitiOns, each being SmaIl but PUre.（估计不考）FQr attribute 5 the COrreSPOllding CoulltS and PrObabilitieS are:5十T31F14The entropy for a is-(34)lg2(34)-(1/4) Iog2-(l5)lg2(l5) - (4/5) lg2 (4/5)=0.761G.TherefoTer the information gain for 1 is 0.9911 0

5、.7GIG = 0.2294.FOr attribute Q2, the COrreSPOnding COlnItS and ProbabilitieS are:d*2+-T23F22TIIe entropy for 2 is計一 (2/5) lg2 (2/5) - (35)lg2 (3/5)+ -(24)log2(24)-(2/4) Iog2 (2/4) = 0.9839.TIIerefbref the information gain for is 0,9911 一 0.9839 = 0,0072,答3：COntinUOUS Attributes: COmPUting Gini Index

6、. FOr efficient COmPUtation: for each attribute,一 SOrtthe attribute On VaIUeS一 Linearly SCan these values, each time UPdating the COUnt matrix and COmPUting gini index一 ChOOSe the SPlit POSitiOn that has the IeaSt gini index3ClaSS IabelSPlIt PointEntrOPyInfO GaLirl1.0十2.00.84840.14273.0-3.50.98850.0

7、0264.0+4.50.918i0.07285.05.0550.98390.00726.0一6.50.97280.01837.07.0+7.50.888&0.1022答 4: ACCOrding to information gain, produces the best SPI it. 答5:EXamPIeS for COmPUting ErrOrErrOr(J= I- max P(J t)=O答6:Binary Attributes: COmPUting GlNI IndeXSPIitS into two PartitiOnSEffeCt Of Weighing Partitions:一

8、Larger 2nd PUrer PartitiOnS are SOUght for.NOdeNl NOdeN2rqParentClZoPlI GiIli :=0500Gini(NI)=1 _(5/7)2 _ (2/7)2= 0.408Gini(N2)=1 -(1/5)2-(4/5)2= 0.32NIN2CI51C224Gini=O.333Gini(ChiIdren)= 7/12*0.408 +5/12*0.32= 0.3714/18/2004 34I-TantSteinbach KUmarIntrOdUonto Data MininaFor attribute 11 the gini ind

9、ex isA 片 1 - (3/4)2 - (1/4)2 +-1- (1/5)2 _ (4/5)2 = 0ta444.aJ J FOr attribute 2. the gini index isR 4 1-(2/5)2-(3/5)2 + g 1 - (2/4)2 _ (2/4)2 = 0.488&.SinCe the gini index for a is smaller, it PrOduCeS the better split.考虑如下二元分类问题的数据集AB类标号TF+TTTTTFTT+FFFFFFTTTF3二元分类问题不纯性度量之间的比较1.计算信息增益，决罠树归纳算法会选用哪个属性

10、ThG COntingenCy tables aft.er SPIitting On attributes A and B arc:The OVerall entropy before SPIitting is:EOrig = 0.4 log 0.4 ().Glog 0.C = 0.9710The information gain after SPlitting On A is:EA=T = -IlOgf-IIogl = O2D _ 3. 3 ()1 () _ nEA=P = IOg m m Sg = = Eorig - 7WE=t - 310=f = 0.2813The informatio

11、n gain after SPIitting Orl B is:3 3 IIEB=T = -T lg 了 T lg = 08113 J 5EP=F = IOg log = 0.6500 = Eorig - 410E=T - 6/1OEB= = (),2565Therefbre. attribute A Will bo ChoSCTI to SPlit the node.2.计算gini指标，决策树归纳会用哪个属性The OVeralI gini before SPIitting is:Gorig = 1 - 0.42 - 0.62 = 0.48Th? gain in gini after SP

12、litting On A is: = GOrig - 7/10GA=T - 3/10G川=F = 0J371The gain in gini after SPlitting On B is:GB=T = I-Q)2-Q)2 =0.3750 6 = 1 = (I) (I) 2778 = Gtrig 4/1OGB=T 6/10GB=F = 0.1633Therefore, attribute B Will be ChOSelI to SPIit t.he node.这个答案没问题3.从图4T3可以看出炳和gini指标在0都是单调递增，而之间单调递减。有没有可能信息增益和gini 指标增益支持不同的

13、属性解释你的理由YeSt even though these measures have Simi Iar range and monOtOr)OUS%behavior, their respective gains, , WhiCh are SCaIed differences Of the measures, do not necessarily behave in the Same way, as iI IUStrated by the results in PartS (a) and (b)贝叶斯分类EXamPIe Of NaYVe BayeS CIaSSifierGiVen a Te

14、St Record:X 二(RefUnd 二 No, Married. InCOme 二 120K)naive BayeS Classifier:Tan ,StelDac. KUmaf InUOdUCtiOnlO Data MininQ 4ia2004 667.考虑540中的数据集。匀慝7茁数抿建d:ABC类1OOO+2OO13O11-4O11一5OO161O1+71O181O1一91I1+101O1+(a)估计条件概率 P(Aj+), P (B+), P(Q+), P(A卜)，P(EH)和 P(CIb)根据(a)中的条件概率，使用朴素贝叶斯方法预测测试样本(A = O,=l, C = O)的

15、类标号。(C)便用m佔计方法(p=l2且加=4)估计条件概率。)同(b),使用(C)中的条件概率。)比较估计概率的两种方法。哪一种更好？为什么？1.PU = 1 /-) = 2/5 二，P(B 二 1 /-)二 2/5 二，P(C1/-) = 1, PA=O/-) = 3/5 =,P(B0/-) = 3/5 =,P(C = O/-) = 0； P(A = 1/+) = 3/5 =,PlB1/+) = 1/6 =,P(C= 1 升)=2/5 =,PA0) = 2/5 =,P(B = OA) = 4/5 =,P(C= 0) = 3/5 二.Lot P(A = OT Z? = 1, Cr = O)

16、 = K.P( + A = (K Z? = 1、C = 0)_ P(A = O, B = LC = 0+) X F(+) = P( A = O, Z? = 1,C = 0)_ P(A = () + )F( = 1+)P(C = 0 + ) X P(+) = ().4 X 0.2 ().6 X 0.5 :X P(+) K(dQ) (3/9) X (5/9) X (15= K=0.0412/KP(-A = O = 1,C = U)_ P( =0：= IC = Ol)乂 P(-)= P(A = O,D = 1, C = O)P(A = U-) X P( = 1-) X P(C = 0-) X P(_

17、)= K(5/9) X (4/9) X (2Q) x 0.5= K=0.0274 B条件独立吗？1.PA = 1/+) = , PB = 1 /+) = , P(C = 1 /+) = , PA -1/-) = , P(B=I/-)= , and P(C= 1/-)=2.Let R : (.A= f B - 1, C=I) be the test record. TO determine itsclass, v/e need to COmPUte PalR) and P - IR) USing BayeS theorem, PIR) = PIRlHPW /P(R) and P(- IR) =P

18、(RlmPe SinCe P(+) = P(-) = and P(R is COnStant, R Can be ClaSSified byCOmParing PalR) and P - IR)FOr this question,PIRiH =PU=I /+) XP(B = /+) XP(C= 2 =PIRl- ) = P(彳二 1 /- ) XP(B= 卜)XP(C= H-)=SinCe P(RIm is Iarger, the record i S assigned to (+) class. 3.P(A =1)=, P(B =1)= and PA =I=I) = P(A) Pff)= T

19、herefore, A and B are independent.4.PA = 1) = f P(B = O) = , and P(A = 1,F = O) = PIA =1) X P(B = O)= A and B are Sti I I independent.5.COmPare PA =IJ=I /+) = against P(A = 1 /+) = andP(B = 11ClaSS = +)= SinCe the PrOdUCt between P(A = 1 /+) and P(A = 1 /- ) are not the Same as P(A = 1, 5 = 1 ), A a

20、nd B arenot COnditiOnaIly independert given the ClaSS三.使用下表中的相似皮矩阵进行单琏和全链展次聚类。绘制树状况显示结果，树状图应该淸楚地显示合并的次序。Table 8.1. Similanty matrix for EXerCiSe 16.2.考虑表622中显示的数据集。表622购物篮事务的例子顾客ID事务ID购买项1OOolatd9e10024.atbfc20012cMe20031ci,e)300156ce30022Mej4002940040M,c5003350038(a)将每个事务ID视为一个购物篮，计算项集e.b.d和bde的支持虔

21、。(b)使用(町的计算结果，计算关联规则b,d-e和何一&刃的置信度。置信度是对称的度量吗？(C)将每个顾客ID作为一个购物篮，重复(a)。应当将每个项看作一个二元变量(如果一个顼在顾客的购买事务中至少出现了一次，则为h杏则，为0)。9)便用(C)的计算结果，计算关联规则2,Nf何和何一方,刃的置信度。(e)假定印和G是将每个事务ID作为一个购物篮时关联规则r的支持度和宣信度，而也和C2是将每个顾客ID作为一个购物篮肘关联规则r的支持度和置信度。讨论Sl和$2 或G和Q之间是否存在某种关系？s(e)=就)=8 2 2 OOO- - -8 -W 2 -W 2 -W(6, d,e)=Ns co

22、nfidence is not a SyTnmetriC measure.仆) = 7=0.8Os() = = 1s(b,d,E) = T =0.8c(bd e)c(e bd)There are no apparent reIatiOnShiPS betWeen s, s, c9 and c.6.考虑表623中显示的购物篮事务。表623购物篮事务事务ID购买坝12345678910牛奶.啤酒.尿布回包，黄泊，牛奶牛奶尿布.饼干面包黄饼千啤酒饼干，尿布牛奶尿布.面包，黄沟面包黃油，尿布咤酒，尿布牛奶，尿布面包，贺油 (呻酒饼干(a)从这些数据中，能够提取出的关联规则的最大数量是多少(包

23、括零支持度的规则)？(b)能够提取的频繁项集的最大长度是多少(假定最小支持度0) ?(C)写出从该数据集中能够提取的3项集的最大数量的表达式。(d)找出一个具有最大支持度的项集(长度为2或更大)。(e)找出一对项和力，使得规则a-b和6f具有相同的置信度。(a)What is the IIIaXimllm nnber Of association rules that Can be ctacted from this (Iata (including rules that have ZCrQ support)?Answer: There are SiX items in the (Iata s

24、et. Therefore the total number Of rules is GO2.(b)What is the maximum SiZe Of frequent itomsots that. Can be extracted (assuming min sup O)?Answer: BeCaIISe the IolIgeSt transart ion ContainS 4 items, the naxi- IInlIn SiZe Of frequent itemset is 4.(C) WrLte an expression for the maximum IlILmber Of

25、SiZe-3 itemsets that Cail be derived from this data set.Answer: (；) = 20.(d)FiTld An itemsot (Of SiZ 2 OT IaTgOr) that has the IargeSt support- Answer: Bread. Butter.(e)Find a Pair Of items, and b. SUCh that the rules a 6 ad b a have the SaIne COlIfidCnCeAnswer: (BeCrJ COOkieS) Or (Bread, Butter).8. A3算法使用产生-计数的策略找出频繁项集。通过合并一对大小为&的频緊项集得到一个大小为炽4的候选项集（称作猴选产生步骤）。在候选项集剪枝步骤中，如果一个候选项集的任何一个子集

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？