数据挖掘报告Word下载.docx
《数据挖掘报告Word下载.docx》由会员分享,可在线阅读,更多相关《数据挖掘报告Word下载.docx(17页珍藏版)》请在冰豆网上搜索。
ID3算法的核心是:
在决策树各级结点上选择属性时,通过计算信息增益(InformationGain)来选择属性,以使得在每一个非叶结点进行测试时,能获得关于被测试记录最大的类别信息。
其具体方法是:
检测所有的属性,选择信息增益最大的属性产生决策树结点,由该属性的不同取值建立分支,再对各分支的子集递归调用该方法建立决策树结点的分支,直到所有子集仅包含同一类别的数据为止。
最后得到一棵决策树,它可以用来对新的样本进行分类。
(2)C4.5算法继承了ID3算法的优点,并在以下几方面对ID3算法进行了改进:
1)用信息增益率(GainRate)来选择属性,克服了用信息增益选择属性时偏向选择取值多的属性的不足;
2)在树构造过程中进行剪枝;
3)能够完成对连续属性的离散化处理;
4)能够对不完整数据进行处理。
(3)Gini度量:
一般决策树中,使用信息量作为评价节点分裂质量的参数,有些算法中使用gini指标代替信息量,gini指标比信息量性能更好,且计算方便,对数据集包含n个类的数据集S,gini(S)定义为:
gini(S)=1-∑pj*pj。
2、朴素贝叶斯算法
朴素贝叶斯基于贝叶斯定理,假定预测变量属性就目标属性而言在条件上彼此独立。
找出各个分类的可能性,再查看对像数据元组X在分类中的可能性,这时,由于属性都是独立的,所在,X在各个分类的可能性就被计算出来,可能性最大的就是X应该的分类。
朴素贝叶斯算法涉及计算目标和预测属性值每对组合的概率。
为了控制这类组合的数量,有连续值或者大量不同值的属性通常进行分箱处理。
本文通过实验对以上算法进行对比说明,检验各种算法实现的分类器的准确率。
二、数据集说明及问题分析
1、训练集adult.data.txt、测试集adult.test.txt下载地址:
http:
//archive.ics.uci.edu/ml/machine-learning-databases/adult/
2、数据集清理说明:
(1)原数据集共有14个属性:
age、workclass、fnlwgt、education、education_num、marital_status、occupation、relationship、race、sex、capital_gain、capital_loss、hours_per_week、native_country,根据这些属性用来判断每个人每年赚钱是否能够超过50k。
(2)由于原属性太多,只保留age、education(与education_num表示含义相同)、occupation、sex、native_country等5个与makeover50k最相关的属性。
对它们分类如下:
●ageType原为连续值,分为6类:
<
=20为year0_20,21-30为year21_30,31-40为year31_40,41-50为year41_50,51-60为year51_60,>
=61为yearover60。
●原有education_numType:
1-16级,分成6级:
edu1_3,edu4_6,edu7_9,edu10_12,
edu13_14,edu15_16
●occupationType共14类,保持不变:
Tech_support,Craft_repair,Other_service(?
),Sales,Exec_managerial,Prof_specialty,Handlers_cleaners,Machine_op_inspct,Adm_clerical,Farming_fishing,Transport_moving,Priv_house_serv,Protective_serv,Armed_Forces
●sexType分为Male、Female两类
●原有native-countryType:
United-States,Cambodia,England,Puerto-Rico,Canada,Germany,
Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines,Italy,Poland,Jamaica,Vietnam,Mexico,Portugal,Ireland,France,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&
Tobago,Peru,Hong,Holand-Netherlands}
将native_country分成5类developNO1,developNO2,developNO3,developNO4,developNO5后:
developNO1:
(<
0.1):
Outlying-US(Guam-USVI-etc),Vietnam,Mexico,Dominican-Republic,
Laos,Haiti,Hungary,Guatemala,Nicaragua,Scotland,El-Salvador,Trinadad&
Tobago,
Holand-Netherlands
developNO2:
(>
=0.1<
0.2):
Puerto-Rico,South,China,Cuba,Poland,Jamaica,Portugal,
Ireland,Ecuador,Peru,?
developNO3:
=0.2<
0.3):
Honduras,France,Columbia,United-States,England,Germany,Greece,
Philippines,Thailand,Yugoslavia
developNO4:
=0.3<
0.4):
India.Japan.
developNO5:
=0.4):
Cambodia,Canada,Iran,Italy,Taiwan,Hong
划分计算方法:
SELECTCOUNT(*)
FROMtraindata1
WHEREnative_country='
?
'
SELECTCOUNT(*)
ANDmakeover='
>
50K'
然后将两数相除,按结果划分。
具体数值:
Outlying-US(Guam-USVI-etc):
0Vietnam:
0.08Mexico:
0.064Dominican-Republic:
0.083Laos:
0Haiti:
0.067Hungary:
Guatemala:
0.048Nicaragua:
0Scotland:
0El-Salvador:
0.097Trinadad&
Tobago:
0Holand-Netherlands:
Puerto-Rico:
0.1395South:
0.1China:
0.1613Cuba:
0.1516Poland:
0.1739Jamaica:
0.1875Portugal:
0.1667
Ireland:
0.125Ecuador:
0.125Peru:
0.1111?
:
0.1111
Honduras:
0.2France:
0.2Columbia:
0.2United-States:
0.2418England:
0.2812Germany:
0.2653Greece:
0.2222Philippines:
0.2632Thailand:
0.25Yugoslavia:
0.25
India:
0.3Japan:
0.3
Cambodia:
0.4285Canada:
0.404Iran:
0.4783Italy:
0.4545Taiwan:
0.4211Hong:
0.5
三、实验结果
1、数据库查看:
所有原始及清理后的数据放在adult_data.mdf中,分为4部分:
原始训练集、清理后训练集、原始测试集和清理后测试集。
注意:
原始数据集只能用来查看,不能用来计算或测试。
清理后的训练集可用来生成决策树及概率,并可测试,清理后测试集只能用来测试。
2、决策树算法:
根据不同的度量属性,有三种生成方法
(1)导入清理后训练集,选择InformationGain生成决策树
使用训练集大约生成160条makeover50k=yes的规则。
点击“决策树测试”,训练集测试结果如下:
测试数据总数为:
10736
trueYes数据总数为:
247
trueNo数据总数为:
8032
falseYes数据总数为:
154
falseNo数据总数为:
2303
测试数据准确率为:
77.1143815201192%
导入清理后测试集,点击“决策树测试”,测试集测试结果如下:
16281
319
12123
312
3527
76.4203672993059%
(2)导入清理后训练集,选择GainRate生成决策树
使用训练集大约生成50条makeover50k=yes的规则。
17
8182
4
2533
76.3692250372578%
6
12407
28
3840
76.2422455623119%
(3)导入清理后训练集,选择GainRate生成决策树
使用训练集生成6条makeover50k=yes的规则。
1014
7554
632
1536
79.806259314456%
1498
11516
919
2348
79.9336650082919%
3、朴素贝叶斯算法
导入清理后训练集,选择“计算训练集各属性概率”
训练集数据中makeover50k概率为:
P(makeover50k=yes):
0.237518628912072
P(makeover50k=no):
0.762481371087928
ageType类型中makeover50k=yes概率为:
P(ageType=year0_20|makeover50k=yes):
0.000392156862745098
P(ageType=year21_30|makeover50k=yes):
0.0803921568627451
P(ageType=year31_40|makeover50k=yes):
0.65921568627451
P(ageType=year41_50|makeover50k=yes):
0.190196078431373
P(ageType=year51_60|makeover50k=yes):
P(ageType=yearover60|makeover50k=yes):
0.0698039215686275
educationType类型中makeover50k=yes概率为:
P(educationType=edu1_3|makeover50k=yes):
0.00431372549019608
P(educationType=edu4_6|makeover50k=yes):
0.0196078431372549
P(educationType=edu7_9|makeover50k=yes):
0.234901960784314
P(educationType=edu10_12|makeover50k=yes):
0.246666666666667
P(educationType=edu13_14|makeover50k=yes):
0.403529411764706
P(educationType=edu15_16|makeover50k=yes):
.0909********
occupationType类型中makeover50k=yes概率为:
P(occupationType=Tech_support|makeover50k=yes):
0.0380392156862745
P(occupationType=Craft_repair|makeover50k=yes):
0.116470588235294
P(occupationType=Other_service|makeover50k=yes):
.0439********
P(occupationType=Sales|makeover50k=yes):
0.132********0588
P(occupationType=Exec_managerial|makeover50k=yes):
0.247450980392157
P(occupationType=Prof_specialty|makeover50k=yes):
0.232156862745098
P(occupationType=Handlers_cleaners|makeover50k=yes):
0.012156862745098
P(occupationType=Machine_op_inspct|makeover50k=yes):
0.0341176470588235
P(occupationType=Adm_clerical|makeover50k=yes):
.0596********
P(occupationType=Farming_fishing|makeover50k=yes):
0.0129411764705882
P(occupationType=Transport_moving|makeover50k=yes):
P(occupationType=Priv_house_serv|makeover50k=yes):
P(occupationType=Protective_serv|makeover50k=yes):
0.0262745098039216
P(occupationType=Armed_Forces|makeover50k=yes):
sexType类型中makeover50k=yes概率为:
P(sexType=Male|makeover50k=yes):
0.844705882352941
P(sexType=Female|makeover50k=yes):
0.155********7059
ageType类型中makeover50k=no概率为:
P(ageType=year0_20|makeover50k=no):
.0999********
P(ageType=year21_30|makeover50k=no):
0.304544344001955
P(ageType=year31_40|makeover50k=no):
0.422184216955778
P(ageType=year41_50|makeover50k=no):
0.104202296603958
P(ageType=year51_60|makeover50k=no):
P(ageType=yearover60|makeover50k=no):
.0691********
educationType类型中makeover50k=no概率为:
P(educationType=edu1_3|makeover50k=no):
0.0190569264598094
P(educationType=edu4_6|makeover50k=no):
0.0786709015392133
P(educationType=edu7_9|makeover50k=no):
0.412777913510872
P(educationType=edu10_12|makeover50k=no):
0.320180796481798
P(educationType=edu13_14|makeover50k=no):
0.159********3406
P(educationType=edu15_16|makeover50k=no):
0.00989494258490105
occupationType类型中makeover50k=no概率为:
P(occupationType=Tech_support|makeover50k=no):
.024*********
P(occupationType=Craft_repair|makeover50k=no):
0.121915465428781
P(occupationType=Other_service|makeover50k=no):
0.197654532128023
P(occupationType=Sales|makeover50k=no):
0.111531883703885
P(occupationType=Exec_managerial|makeover50k=no):
.0814********
P(occupationType=Prof_specialty|makeover50k=no):
.0935********
P(occupationType=Handlers_cleaners|makeover50k=no):
0.0487417542145126
P(occupationType=Machine_op_inspct|makeover50k=no):
0.0708526752992915
P(occupationType=Adm_clerical|makeover50k=no):
0.137********8629
P(occupationType=Farming_fishing|makeover50k=no):
.0350********
P(occupationType=Transport_moving|makeover50k=no):
0.0522843879794772
P(occupationType=Priv_house_serv|makeover50k=no):
0.00598582946494014
P(occupationType=Protective_serv|makeover50k=no):
0.0186904471048131
P(occupationType=Armed_Forces|makeover50k=no):
0.000244319569997557
sexType类型中makeover50k=no概率为:
P(sexType=Male