数据库聚类分析外文翻译文献.docx

资源描述

数据库聚类分析外文翻译文献.docx

《数据库聚类分析外文翻译文献.docx》由会员分享，可在线阅读，更多相关《数据库聚类分析外文翻译文献.docx（13页珍藏版）》请在冰豆网上搜索。

数据库聚类分析外文翻译文献.docx

数据库聚类分析外文翻译文献

（文档含中英文对照即英文原文和中文翻译）

Clustering

5.1INTRODUCTION

Clusteringissimilartoclassificationinthatdataaregrouped.However,unlikeclassification,thegroupsarenotpredefined.Instead,thegroupingisaccomplishedbyfindingsimilaritiesbetweendataaccordingtocharacteristicsfoundintheactualdata.Thegroupsarecalledclusters.Someauthorsviewclusteringasaspecialtypeofclassification.Inthistext,however,wefollowamoreconventionalviewinthatthetwoaredifferent.Manydefinitionsforclustershavebeenproposed:

●Setoflikeelements.Elementsfromdifferentclustersarenotalike.

●Thedistancebetweenpointsinaclusterislessthanthedistancebetweenapointintheclusterandanypointoutsideit.

Atermsimilartoclusteringisdatabasesegmentation,whereliketuple（record）inadatabasearegroupedtogether.Thisisdonetopartitionorsegmentthedatabaseintocomponentsthatthengivetheuseramoregeneralviewofthedata.Inthiscasetext,wedonotdifferentiatebetweensegmentationandclustering.AsimpleexampleofclusteringisfoundinExample5.1.Thisexampleillustratesthefactthatthatdetermininghowtodotheclusteringisnotstraightforward.

AsillustratedinFigure5.1,agivensetofdatamaybeclusteredondifferentattributes.Hereagroupofhomesinageographicareaisshown.Thefirstfloortypeofclusteringisbasedonthelocationofthehome.Homesthataregeographicallyclosetoeachotherareclusteredtogether.Inthesecondclustering,homesaregroupedbasedonthesizeofthehouse.

Clusteringhasbeenusedinmanyapplicationdomains,includingbiology,medicine,anthropology,marketing,andeconomics.Clusteringapplicationsincludeplantandanimalclassification,diseaseclassification,imageprocessing,patternrecognition,anddocumentretrieval.Oneofthefirstdomainsinwhichclusteringwasusedwasbiologicaltaxonomy.RecentusesincludeexaminingWeblogdatatodetectusagepatterns.

Whenclusteringisappliedtoareal-worlddatabase,manyinterestingproblemsoccur:

●Outlierhandlingisdifficult.Heretheelementsdonotnaturallyfallintoanycluster.Theycanbeviewedassolitaryclusters.However,ifaclusteringalgorithmattemptstofindlargerclusters,theseoutlierswillbeforcedtobeplacedinsomecluster.Thisprocessmayresultinthecreationofpoorclustersbycombiningtwoexistingclustersandleavingtheoutlierinitsowncluster.

●Dynamicdatainthedatabaseimpliesthatclustermembershipmaychangeovertime.

●Interpretingthesemanticmeaningofeachclustermaybedifficult.Withclassification,thelabelingoftheclassesisknownaheadoftime.However,withclustering,thismaynotbethecase.Thus,whentheclusteringprocessfinishescreatingasetofclusters,theexactmeaningofeachclustermaynotbeobvious.Hereiswhereadomainexpertisneededtoassignalabelorinterpretationforeachcluster.

●Thereisnoonecorrectanswertoaclusteringproblem.Infact,manyanswersmaybefound.Theexactnumberofclustersrequiredisnoteasytodetermine.Again,adomainexpertmayberequired.Forexample,supposewehaveasetofdataaboutplantsthathavebeencollectedduringafieldtrip.Withoutanypriorknowledgeofplantclassification,ifweattempttodividethissetofdataintosimilargroupings,itwouldnotbeclearhowmanygroupsshouldbecreated.

●Anotherrelatedissueiswhatdatashouldbeusedofclustering.Unlikelearningduringaclassificationprocess,wherethereissomeaprioriknowledgeconcerningwhattheattributesofeachclassificationshouldbe,inclusteringwehavenosupervisedlearningtoaidtheprocess.Indeed,clusteringcanbeviewedassimilartounsupervisedlearning.

Wecanthensummarizesomebasicfeaturesofclustering（asopposedtoclassification）:

●The（best）numberofclustersisnotknown.

●Theremaynotbeanyaprioriknowledgeconcerningtheclusters.

●Clusterresultsaredynamic.

TheclusteringproblemisstatedasshowninDefinition5.1.Hereweassumethatthenumberofclusterstobecreatedisaninputvalue,k.Theactualcontent（andinterpretation）ofeachcluster,

isdeterminedasaresultofthefunctiondefinition.Withoutlossofgenerality,wewillviewthattheresultofsolvingaclusteringproblemisthatasetofclustersiscreated:

K={

DEFINITION5.1.GivenadatabaseD={

}oftuplesandanintegervaluek,theclusteringproblemistodefineamappingf:

whereeach

isassignedtoonecluster

.Acluster

containspreciselythosetuplesmappedtoit;thatis,

and

AclassificationofthedifferenttypesofclusteringalgorithmsisshowninFigure5.2.Clusteringalgorithmsthemselvesmaybeviewedashierarchicalorpartitional.Withhierarchicalclustering,anestedsetofclustersiscreated.Eachlevelinthehierarchyhasaseparatesetofclusters.Atthelowestlevel,eachitemisinitsownuniquecluster.Atthehighestlevel,allitemsbelongtothesamecluster.Withhierarchicalclustering,thedesirednumberofclustersisnotinput.Withpartitionalclustering,thealgorithmcreatesonlyonesetofclusters.Theseapproachesusethedesirednumberofclusterstodrivehowthefinalsetiscreated.Traditionalclusteringalgorithmstendtobetargetedtosmallnumericdatabasethatfitintomemory.Thereare,however,morerecentclusteringalgorithmsthatlookatcategoricaldataandaretargetedtolarger,perhapsdynamic,databases.Algorithmstargetedtolargerdatabasesmayadapttomemoryconstraintsbyeithersamplingthedatabaseorusingdatastructures,whichcanbecompressedorprunedtofitintomemoryregardlessofthesizeofthedatabase.Clusteringalgorithmsmayalsodifferbasedonwhethertheyproduceoverlappingornonoverlappingclusters.Eventhoughweconsideronlynonoverlappingclusters,itispossibletoplaceaniteminmultipleclusters.Inturn,nonoverlappingclusterscanbeviewedasextrinsicorintrinsic.Extrinsictechniquesuselabelingoftheitemstoassistintheclassificationprocess.Thesealgorithmsarethetraditionalclassificationsupervisedlearningalgorithmsinwhichaspecialinputtrainingsetisused.Intrinsicalgorithmsdonotuseanyaprioricategorylabels,butdependonlyontheadjacencymatrixcontainingthedistancebetweenobjects.Allalgorithmsweexamineinthischapterfallintotheintrinsicclass.

Thetypesofclusteringalgorithmscanbefurtheredclassifiedbasedontheimplementationtechniqueused.Hierarchicalalgorithmscanbecategorizedasagglomerativeordivisive.”Agglomerative”impliesthattheclustersarecreatedinabottom-upfashion,whiledivisivealgorithmsworkinatop-downfashion.Althoughbothhierarchicalandpartitionalalgorithmscouldbedescribedusingtheagglomerativevs.divisivelabel,ittypicallyismoreassociatedwithhierarchicalalgorithms.Anotherdescriptivetagindicateswhethereachindividualelementishandledonebyone,serial（sometimescalledincremental）,orwhetherallitemsareexaminedtogether,simultaneous.Ifaspecifictupleisviewedashavingattributevaluesforallattributesintheschema,thenclusteringalgorithmscoulddifferastohowtheattributevaluesareexamined.Asisusuallydonewithdecisiontreeclassificationtechniques,somealgorithmsexamineattributevaluesoneatatime,monothetic.Polytheticalgorithmsconsiderallattributevaluesatonetime.Finally,clusteringalgorithmscanbelabeledbaseonthemathematicalformulationgiventothealgorithm:

graphtheoreticormatrixalgebra.Inthischapterwegenerallyusethegraphapproachanddescribetheinputtotheclusteringalgorithmasanadjacencymatrixlabeledwithdistancemeasure.

Wediscussmanyclusteringalgorithmsinthefollowingsections.Thisisonlyarepresentativesubsetofthemanyalgorithmsthathavebeenproposedintheliterature.Beforelookingatthesealgorithms,wefirstexaminepossiblesimilaritymeasuresandexaminetheimpactofoutliers.

5.2SIMILARITYANDDISTANCEMEASURES

Therearemanydesirablepropertiesfortheclusterscreatedbyasolutiontoaspecificclusteringproblem.Themostimportantoneisthatatuplewithinoneclusterismoreliketupleswithinthatclusterthanitissimilartotuplesoutsideit.Aswithclassification,then,weassumethedefinitionofasimilaritymeasure,sim（

）,definedbetweenanytwotuples,

.Thisprovidesamorestrictandalternativeclusteringdefinition,asfoundinDefinition5.2.Unlessotherwisestated,weusethefirstdefinitionratherthanthesecond.Keepinmindthatthesimilarityrelationshipstatedwithintheseconddefinitionisadesirable,althoughnotalwaysobtainable,property.

Adistancemeasure,dis（

）,asopposedtosimilarity,isoftenusedinclustering.Theclusteringproblemthenhasthedesirablepropertythatgivenacluster,

and

Someclusteringalgorithmslookonlyatnumericdata,usuallyassumingmetricdatapoints.Metricattributessatisfythetriangularinequality.Theclustercanthenbedescribedbyusingseveralcharacteristicvalues.Givenacluster,

ofNpoints{

},wemakethefollowingdefinitions[ZRL96]:

Herethecentroidisthe“middle”ofthecluster;itneednotbeanactualpointinthecluster.Someclusteringalgorithmsalternativelyassumethattheclusterisrepresentedbyonecentrallylocatedobjectintheclustercalledamedoid.Theradiusisthesquarerootoftheaveragemeansquareddistancefromanypointintheclustertothecentroid,andofpointsinthecluster.Weusethenotation

toindicatethemedoidforcluster

Manyclusteringalgorithmsrequirethatthedistancebetweenclusters（ratherthanelements）

展开阅读全文