数据挖掘报告Word下载.docx

上传人:b****6 文档编号:21249435 上传时间:2023-01-28 格式:DOCX 页数:17 大小:826.96KB
下载 相关 举报
数据挖掘报告Word下载.docx_第1页
第1页 / 共17页
数据挖掘报告Word下载.docx_第2页
第2页 / 共17页
数据挖掘报告Word下载.docx_第3页
第3页 / 共17页
数据挖掘报告Word下载.docx_第4页
第4页 / 共17页
数据挖掘报告Word下载.docx_第5页
第5页 / 共17页
点击查看更多>>
下载资源
资源描述

数据挖掘报告Word下载.docx

《数据挖掘报告Word下载.docx》由会员分享,可在线阅读,更多相关《数据挖掘报告Word下载.docx(17页珍藏版)》请在冰豆网上搜索。

数据挖掘报告Word下载.docx

ID3算法的核心是:

在决策树各级结点上选择属性时,通过计算信息增益(InformationGain)来选择属性,以使得在每一个非叶结点进行测试时,能获得关于被测试记录最大的类别信息。

其具体方法是:

检测所有的属性,选择信息增益最大的属性产生决策树结点,由该属性的不同取值建立分支,再对各分支的子集递归调用该方法建立决策树结点的分支,直到所有子集仅包含同一类别的数据为止。

最后得到一棵决策树,它可以用来对新的样本进行分类。

 

(2)C4.5算法继承了ID3算法的优点,并在以下几方面对ID3算法进行了改进:

1)用信息增益率(GainRate)来选择属性,克服了用信息增益选择属性时偏向选择取值多的属性的不足;

2)在树构造过程中进行剪枝;

3)能够完成对连续属性的离散化处理;

4)能够对不完整数据进行处理。

(3)Gini度量:

一般决策树中,使用信息量作为评价节点分裂质量的参数,有些算法中使用gini指标代替信息量,gini指标比信息量性能更好,且计算方便,对数据集包含n个类的数据集S,gini(S)定义为:

gini(S)=1-∑pj*pj。

2、朴素贝叶斯算法

朴素贝叶斯基于贝叶斯定理,假定预测变量属性就目标属性而言在条件上彼此独立。

找出各个分类的可能性,再查看对像数据元组X在分类中的可能性,这时,由于属性都是独立的,所在,X在各个分类的可能性就被计算出来,可能性最大的就是X应该的分类。

朴素贝叶斯算法涉及计算目标和预测属性值每对组合的概率。

为了控制这类组合的数量,有连续值或者大量不同值的属性通常进行分箱处理。

本文通过实验对以上算法进行对比说明,检验各种算法实现的分类器的准确率。

二、数据集说明及问题分析

1、训练集adult.data.txt、测试集adult.test.txt下载地址:

http:

//archive.ics.uci.edu/ml/machine-learning-databases/adult/

2、数据集清理说明:

(1)原数据集共有14个属性:

age、workclass、fnlwgt、education、education_num、marital_status、occupation、relationship、race、sex、capital_gain、capital_loss、hours_per_week、native_country,根据这些属性用来判断每个人每年赚钱是否能够超过50k。

(2)由于原属性太多,只保留age、education(与education_num表示含义相同)、occupation、sex、native_country等5个与makeover50k最相关的属性。

对它们分类如下:

●ageType原为连续值,分为6类:

<

=20为year0_20,21-30为year21_30,31-40为year31_40,41-50为year41_50,51-60为year51_60,>

=61为yearover60。

●原有education_numType:

1-16级,分成6级:

edu1_3,edu4_6,edu7_9,edu10_12,

edu13_14,edu15_16

●occupationType共14类,保持不变:

Tech_support,Craft_repair,Other_service(?

),Sales,Exec_managerial,Prof_specialty,Handlers_cleaners,Machine_op_inspct,Adm_clerical,Farming_fishing,Transport_moving,Priv_house_serv,Protective_serv,Armed_Forces

●sexType分为Male、Female两类

●原有native-countryType:

United-States,Cambodia,England,Puerto-Rico,Canada,Germany,

Outlying-US(Guam-USVI-etc),India,Japan,Greece,South,China,Cuba,Iran,Honduras,Philippines,Italy,Poland,Jamaica,Vietnam,Mexico,Portugal,Ireland,France,Dominican-Republic,Laos,Ecuador,Taiwan,Haiti,Columbia,Hungary,Guatemala,Nicaragua,Scotland,Thailand,Yugoslavia,El-Salvador,Trinadad&

Tobago,Peru,Hong,Holand-Netherlands}

将native_country分成5类developNO1,developNO2,developNO3,developNO4,developNO5后:

developNO1:

(<

0.1):

Outlying-US(Guam-USVI-etc),Vietnam,Mexico,Dominican-Republic,

Laos,Haiti,Hungary,Guatemala,Nicaragua,Scotland,El-Salvador,Trinadad&

Tobago,

Holand-Netherlands

developNO2:

(>

=0.1<

0.2):

Puerto-Rico,South,China,Cuba,Poland,Jamaica,Portugal,

Ireland,Ecuador,Peru,?

developNO3:

=0.2<

0.3):

Honduras,France,Columbia,United-States,England,Germany,Greece,

Philippines,Thailand,Yugoslavia

developNO4:

=0.3<

0.4):

India.Japan.

developNO5:

=0.4):

Cambodia,Canada,Iran,Italy,Taiwan,Hong

划分计算方法:

SELECTCOUNT(*)

FROMtraindata1

WHEREnative_country='

?

'

SELECTCOUNT(*)

ANDmakeover='

>

50K'

然后将两数相除,按结果划分。

具体数值:

Outlying-US(Guam-USVI-etc):

0Vietnam:

0.08Mexico:

0.064Dominican-Republic:

0.083Laos:

0Haiti:

0.067Hungary:

Guatemala:

0.048Nicaragua:

0Scotland:

0El-Salvador:

0.097Trinadad&

Tobago:

0Holand-Netherlands:

Puerto-Rico:

0.1395South:

0.1China:

0.1613Cuba:

0.1516Poland:

0.1739Jamaica:

0.1875Portugal:

0.1667

Ireland:

0.125Ecuador:

0.125Peru:

0.1111?

:

0.1111

Honduras:

0.2France:

0.2Columbia:

0.2United-States:

0.2418England:

0.2812Germany:

0.2653Greece:

0.2222Philippines:

0.2632Thailand:

0.25Yugoslavia:

0.25

India:

0.3Japan:

0.3

Cambodia:

0.4285Canada:

0.404Iran:

0.4783Italy:

0.4545Taiwan:

0.4211Hong:

0.5

三、实验结果

1、数据库查看:

所有原始及清理后的数据放在adult_data.mdf中,分为4部分:

原始训练集、清理后训练集、原始测试集和清理后测试集。

注意:

原始数据集只能用来查看,不能用来计算或测试。

清理后的训练集可用来生成决策树及概率,并可测试,清理后测试集只能用来测试。

2、决策树算法:

根据不同的度量属性,有三种生成方法

(1)导入清理后训练集,选择InformationGain生成决策树

使用训练集大约生成160条makeover50k=yes的规则。

点击“决策树测试”,训练集测试结果如下:

测试数据总数为:

10736

trueYes数据总数为:

247

trueNo数据总数为:

8032

falseYes数据总数为:

154

falseNo数据总数为:

2303

测试数据准确率为:

77.1143815201192%

导入清理后测试集,点击“决策树测试”,测试集测试结果如下:

16281

319

12123

312

3527

76.4203672993059%

(2)导入清理后训练集,选择GainRate生成决策树

使用训练集大约生成50条makeover50k=yes的规则。

17

8182

4

2533

76.3692250372578%

6

12407

28

3840

76.2422455623119%

(3)导入清理后训练集,选择GainRate生成决策树

使用训练集生成6条makeover50k=yes的规则。

1014

7554

632

1536

79.806259314456%

1498

11516

919

2348

79.9336650082919%

3、朴素贝叶斯算法

导入清理后训练集,选择“计算训练集各属性概率”

训练集数据中makeover50k概率为:

P(makeover50k=yes):

0.237518628912072

P(makeover50k=no):

0.762481371087928

ageType类型中makeover50k=yes概率为:

P(ageType=year0_20|makeover50k=yes):

0.000392156862745098

P(ageType=year21_30|makeover50k=yes):

0.0803921568627451

P(ageType=year31_40|makeover50k=yes):

0.65921568627451

P(ageType=year41_50|makeover50k=yes):

0.190196078431373

P(ageType=year51_60|makeover50k=yes):

P(ageType=yearover60|makeover50k=yes):

0.0698039215686275

educationType类型中makeover50k=yes概率为:

P(educationType=edu1_3|makeover50k=yes):

0.00431372549019608

P(educationType=edu4_6|makeover50k=yes):

0.0196078431372549

P(educationType=edu7_9|makeover50k=yes):

0.234901960784314

P(educationType=edu10_12|makeover50k=yes):

0.246666666666667

P(educationType=edu13_14|makeover50k=yes):

0.403529411764706

P(educationType=edu15_16|makeover50k=yes):

.0909********

occupationType类型中makeover50k=yes概率为:

P(occupationType=Tech_support|makeover50k=yes):

0.0380392156862745

P(occupationType=Craft_repair|makeover50k=yes):

0.116470588235294

P(occupationType=Other_service|makeover50k=yes):

.0439********

P(occupationType=Sales|makeover50k=yes):

0.132********0588

P(occupationType=Exec_managerial|makeover50k=yes):

0.247450980392157

P(occupationType=Prof_specialty|makeover50k=yes):

0.232156862745098

P(occupationType=Handlers_cleaners|makeover50k=yes):

0.012156862745098

P(occupationType=Machine_op_inspct|makeover50k=yes):

0.0341176470588235

P(occupationType=Adm_clerical|makeover50k=yes):

.0596********

P(occupationType=Farming_fishing|makeover50k=yes):

0.0129411764705882

P(occupationType=Transport_moving|makeover50k=yes):

P(occupationType=Priv_house_serv|makeover50k=yes):

P(occupationType=Protective_serv|makeover50k=yes):

0.0262745098039216

P(occupationType=Armed_Forces|makeover50k=yes):

sexType类型中makeover50k=yes概率为:

P(sexType=Male|makeover50k=yes):

0.844705882352941

P(sexType=Female|makeover50k=yes):

0.155********7059

ageType类型中makeover50k=no概率为:

P(ageType=year0_20|makeover50k=no):

.0999********

P(ageType=year21_30|makeover50k=no):

0.304544344001955

P(ageType=year31_40|makeover50k=no):

0.422184216955778

P(ageType=year41_50|makeover50k=no):

0.104202296603958

P(ageType=year51_60|makeover50k=no):

P(ageType=yearover60|makeover50k=no):

.0691********

educationType类型中makeover50k=no概率为:

P(educationType=edu1_3|makeover50k=no):

0.0190569264598094

P(educationType=edu4_6|makeover50k=no):

0.0786709015392133

P(educationType=edu7_9|makeover50k=no):

0.412777913510872

P(educationType=edu10_12|makeover50k=no):

0.320180796481798

P(educationType=edu13_14|makeover50k=no):

0.159********3406

P(educationType=edu15_16|makeover50k=no):

0.00989494258490105

occupationType类型中makeover50k=no概率为:

P(occupationType=Tech_support|makeover50k=no):

.024*********

P(occupationType=Craft_repair|makeover50k=no):

0.121915465428781

P(occupationType=Other_service|makeover50k=no):

0.197654532128023

P(occupationType=Sales|makeover50k=no):

0.111531883703885

P(occupationType=Exec_managerial|makeover50k=no):

.0814********

P(occupationType=Prof_specialty|makeover50k=no):

.0935********

P(occupationType=Handlers_cleaners|makeover50k=no):

0.0487417542145126

P(occupationType=Machine_op_inspct|makeover50k=no):

0.0708526752992915

P(occupationType=Adm_clerical|makeover50k=no):

0.137********8629

P(occupationType=Farming_fishing|makeover50k=no):

.0350********

P(occupationType=Transport_moving|makeover50k=no):

0.0522843879794772

P(occupationType=Priv_house_serv|makeover50k=no):

0.00598582946494014

P(occupationType=Protective_serv|makeover50k=no):

0.0186904471048131

P(occupationType=Armed_Forces|makeover50k=no):

0.000244319569997557

sexType类型中makeover50k=no概率为:

P(sexType=Male

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 高等教育 > 历史学

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1