数据挖掘实验报告.pdf

资源描述

数据挖掘实验报告.pdf

《数据挖掘实验报告.pdf》由会员分享，可在线阅读，更多相关《数据挖掘实验报告.pdf（11页珍藏版）》请在冰豆网上搜索。

数据挖掘实验报告.pdf

哈尔滨工业大学哈尔滨工业大学数据挖掘理论与算法实验报告数据挖掘理论与算法实验报告（2015年度年度秋秋季学期季学期）课程编码课程编码S1300019C授课教师授课教师邹兆年学生姓名学生姓名谢浩哲学学号号15S103172学学院院计算机科学与技术学院哈尔滨工业大学Page1of10Designedby谢浩哲NOTE:

本报告所涉及的全部代码均已在GitHub上开源:

https:

/一、实验内容NOTE:

各算法的实现思想将在下一节阐述.1.K-MeansK-meansclusteringisamethodofvectorquantization,originallyfromsignalprocessing,thatispopularforclusteranalysisindatamining.k-meansclusteringaimstopartitionnobservationsintokclustersinwhicheachobservationbelongstotheclusterwiththenearestmean,servingasaprototypeofthecluster.2.AGNES（层次聚类）AGNES,knownasAgglomerativeHierarchicalclustering.Thisalgorithmworksbygroupingthedataonebyoneonthebasisofthenearestdistancemeasureofallthepairwisedistancebetweenthedatapoint.Againdistancebetweenthedatapointisrecalculatedbutwhichdistancetoconsiderwhenthegroupshasbeenformed?

Forthistherearemanyavailablemethods.Someofthemare:

-Single-nearestdistanceorsinglelinkage-Complete-farthestdistanceorcompletelinkage-Average-averagedistanceoraveragelinkage-Centroiddistance-Wardsmethod-sumofsquaredEuclideandistanceisminimized3.DBSCANDensity-basedspatialclusteringofapplicationswithnoise（DBSCAN）isadataclusteringalgorithm.Itisadensity-basedclusteringalgorithm:

givenasetofpointsinsomespace,itgroupstogetherpointsthatarecloselypackedtogether（pointswithmanynearbyneighbors）,markingasoutlierspointsthatliealoneinlow-densityregions（whosenearestneighborsaretoofaraway）.DBSCANisoneofthemostcommonclusteringalgorithmsandalsomostcitedinscientificliterature.二、实验设计1.K-Means算法思想:

任意选取点集中的k个点作为中心,对每一个点与k个中心进行对比,划分至以这k个中心为中心点的簇中.划分结束后,重新计算每一个簇的中心点.重复以上过程,直至这些中心点不再变化.哈尔滨工业大学Page2of10Designedby谢浩哲程序流程图:

核心代码:

1publicclassKMeans2publicClustergetClusters（intk,Pointpoints）3if（k=points.length）4returnnull;567Clusterclusters=getInitialClusters（k,points）;8ClusternewClusters=null;9do10newClusters=getClusters（k,points,clusters）;1112if（isClustersTheSame（clusters,newClusters）13break;哈尔滨工业大学Page3of10Designedby谢浩哲1415clusters=newClusters;16while（true）;17returnclusters;181920privateClustergetClusters（intk,Pointpoints,Clustercluster）21for（inti=0;ipoints.length;+i）22PointcurrentPoint=pointsi;23Clusterc=getClosestClusters（currentPoint,cluster）;24c.points.add（currentPoint）;252627ClusternewClusters=newClusterk;28for（inti=0;ik;+i）29Clusterc=clusteri;30intnumberOfPointsInCluster=c.points.size（）;3132if（numberOfPointsInCluster=0）33/Iftheclusterisempty34intrandomIndex=（int）（Math.random（）*points.length）;35newClustersi=newCluster（pointsrandomIndex）;36else37/Iftheclusterisnotempty38doublenewCentroidX=0;39doublenewCentroidY=0;40for（intj=0;jnumberOfPointsInCluster;+j）41Pointp=c.points.get（j）;42newCentroidX+=p.x;43newCentroidY+=p.y;4445newCentroidX/=numberOfPointsInCluster;46newCentroidY/=numberOfPointsInCluster;48ClusternewCluster=newCluster（newPoint（newCentroidX,newCentroidY）;49newClustersi=newCluster;5051哈尔滨工业大学Page4of10Designedby谢浩哲52returnnewClusters;53542.AGNES（层次聚类）算法思想:

算法选用GroupAverage作为合并估量.第一次循环选取n个点中GroupAverage最小值进行合并,将合并后的簇加入列表中,移除之前的2个簇,并重新计算该簇中的点与其他n2个簇的GroupAverage.重复执行之前的步骤,直至所有的簇都被合并.程序流程图:

哈尔滨工业大学Page5of10Designedby谢浩哲核心代码:

1publicclassAgnes2publicClustergetCluster（Listclusters）3while（clusters.size（）1）4doubleminProximity=Double.MAX_VALUE;5intminProximityIndex1=0,minProximityIndex2=0;67for（inti=0;iclusters.size（）;+i）8for（intj=i+1;jclusters.size（）;+j）9doubleproximity=getProximity（clusters.get（i）,clusters.get（j）;1011if（proximityminProximity）12minProximity=proximity;13minProximityIndex1=i;14minProximityIndex2=j;15161718Clusterc=newCluster（clusters.get（minProximityIndex1）,clusters.get（minProximityIndex2）;19clusters.add（c）;20clusters.remove（minProximityIndex2）;21clusters.remove（minProximityIndex1）;2223returnclusters.size（）=0?

null:

clusters.get（0）;24253.DBSCAN算法思想:

首先在所有的点集中识别出CorePoint（对其邻域内点的个数进行计数）,再在剩余的点集中识别出CorePoint（即该点在CorePoint的邻域内）.接着,若两个CorePoint彼此相连,他们是一个Cluster中的点,将所有的CorePoint合并成若干的Cluster.再检查所有的BorderPoint,看该BorderPoint在哪一个CorePoint的邻域内,将其合并至该CorePoint所在的簇.哈尔滨工业大学Page6of10Designedby谢浩哲程序流程图:

核心代码:

以下为该算法核心代码的实现（仅包含识别CorePoint,并将CorePoint分类成簇）1publicclassDbscan2publicListgetClusters（Listpoints,intminPoints,doubleeps）3ListcorePoints=getCorePoints（points,minPoints,eps）;4Mapclusters=getClustersOfCorePoints（corePoints,eps）;56ListborderPoints=getBorderPoints（points,corePoints,minPoints,eps）;7getClustersOfBorderPoints（corePoints,borderPoints,clusters,eps）;8哈尔滨工业大学Page7of10Designedby谢浩哲9returnnewArrayList（clusters.values（）;101112privateListgetCorePoints（Listpoints,intminPoints,doubleeps）13ListcorePoints=newArrayList（）;1415for（inti=0;ipoints.size（）;+i）16PointcurrentPoint=points.get（i）;

展开阅读全文