article1.docx
《article1.docx》由会员分享,可在线阅读,更多相关《article1.docx(18页珍藏版)》请在冰豆网上搜索。
article1
Datamining
FromWikipedia,thefreeencyclopedia
Jumpto:
navigation,search
Nottobeconfusedwithanalytics,informationextraction,machinelearning,ordataanalysis.
DataMining(theanalysisstepoftheKnowledgeDiscoveryinDatabasesprocess,[1]orKDD),arelativelyyoungandinterdisciplinaryfieldofcomputerscience,[2][3]istheprocessofdiscoveringnewpatternsfromlargedatasetsinvolvingmethodsfromstatisticsandartificialintelligencebutalsodatabasemanagement.Incontrasttomachinelearning,theemphasisliesonthediscoveryofpreviouslyunknownpatternsasopposedtogeneralizingknownpatternstonewdata.
Thetermisabuzzword,andisfrequentlymisusedtomeananyformoflargescaledataorinformationprocessing(collection,extraction,warehousing,analysisandstatistics)butalsogeneralizedtoanykindofcomputerdecisionsupportsystemincludingartificialintelligence,machinelearningandbusinessintelligence.Intheproperuseoftheword,thekeytermisdiscovery,commonlydefinedas"detectingsomethingnew".Eventhepopularbook"Datamining:
PracticalmachinelearningtoolsandtechniqueswithJava"[4](whichcoversmostlymachinelearningmaterial)wasoriginallytobenamedjust"Practicalmachinelearning",andtheterm"datamining"wasonlyaddedformarketingreasons.[5]Oftenthemoregeneralterms"(largescale)dataanalysis"or"analytics"aremoreappropriate.
Theactualdataminingtaskistheautomaticorsemi-automaticanalysisoflargequantitiesofdatainordertoextractpreviouslyunknowninterestingpatternssuchasgroupsofdatarecords(clusteranalysis),unusualrecords(anomalydetection)anddependencies(associationrulemining).Thesepatternscanthenbeseenasakindofsummaryoftheinputdata,andusedinfurtheranalysisorforexampleinmachinelearningandpredictiveanalytics.Forexample,thedataminingstepmightidentifymultiplegroupsinthedata,whichcanthenbeusedtoobtainmoreaccuratepredictionresultsbyadecisionsupportsystem.Neitherthedatacollection,datapreparationnorresultinterpretationandreportingarepartofthedataminingstep,butdobelongtotheoveralldataminingprocessasadditionalsteps.
Therelatedtermsdatadredging,datafishinganddatasnoopingrefertotheuseofdataminingmethodstosamplepartsofalargerpopulationdatasetthatare(ormaybe)toosmallforreliablestatisticalinferencestobemadeaboutthevalidityofanypatternsdiscovered.Thesemethodscan,however,beusedincreatingnewhypothesestotestagainstthelargerdatapopulations.
Contents
∙1Background
o1.1Researchandevolution
∙2Process
o2.1Pre-processing
o2.2Datamining
o2.3Resultsvalidation
∙3Standards
∙4Notableuses
o4.1Games
o4.2Business
o4.3Scienceandengineering
o4.4Spatialdatamining
▪4.4.1Challenges
o4.5VisualDataMining
o4.6Surveillance
▪4.6.1Patternmining
▪4.6.2Subject-baseddatamining
∙5Privacyconcernsandethics
∙6Marketplacesurveys
∙7Groupsandassociations
∙8Seealso
o8.1Methods
o8.2Applicationdomains
o8.3Applicationexamples
o8.4Miscellaneous
o8.5Relatedtopics
o8.6Commercialdata-miningsoftwareandapplications
o8.7Freelibreopensourcedata-miningsoftwareandapplications
∙9References
∙10Furtherreading
∙11Externallinks
[edit]Background
Themanualextractionofpatternsfromdatahasoccurredforcenturies.EarlymethodsofidentifyingpatternsindataincludeBayes'theorem(1700s)andregressionanalysis(1800s).Theproliferation,ubiquityandincreasingpowerofcomputertechnologyhasincreaseddatacollection,storageandmanipulations.Asdatasetshavegrowninsizeandcomplexity,directhands-ondataanalysishasincreasinglybeenaugmentedwithindirect,automaticdataprocessing.Thishasbeenaidedbyotherdiscoveriesincomputerscience,suchasneuralnetworks,clustering,geneticalgorithms(1950s),decisiontrees(1960s)andsupportvectormachines(1990s).Dataminingistheprocessofapplyingthesemethodstodatawiththeintentionofuncoveringhiddenpatterns.[6]Ithasbeenusedformanyyearsbybusinesses,scientistsandgovernmentstosiftthroughvolumesofdatasuchasairlinepassengertriprecords,censusdataandsupermarketscannerdatatoproducemarketresearchreports.(Note,however,thatreportingisnotalwaysconsideredtobedatamining.)
Aprimaryreasonforusingdataminingistoassistintheanalysisofcollectionsofobservationsofbehavior.Suchdataarevulnerabletocollinearitybecauseofunknowninterrelations.Anunavoidablefactofdataminingisthatthe(sub-)set(s)ofdatabeinganalyzedmaynotberepresentativeofthewholedomain,andthereforemaynotcontainexamplesofcertaincriticalrelationshipsandbehaviorsthatexistacrossotherpartsofthedomain.Toaddressthissortofissue,theanalysismaybeaugmentedusingexperiment-basedandotherapproaches,suchaschoicemodellingforhuman-generateddata.Inthesesituations,inherentcorrelationscanbeeithercontrolledfor,orremovedaltogether,duringtheconstructionoftheexperimentaldesign.
[edit]Researchandevolution
ThepremierprofessionalbodyinthefieldistheAssociationforComputingMachinery'sSpecialInterestGrouponKnowledgediscoveryandDataMining(SIGKDD).Since1989theyhavehostedanannualinternationalconferenceandpublisheditsproceedings,[7]andsince1999havepublishedabiannualacademicjournaltitled"SIGKDDExplorations".[8]
Computerscienceconferencesondatamininginclude:
∙CIKM-ACMConferenceonInformationandKnowledgeManagement
∙DMIN–InternationalConferenceonDataMining
∙DMKD–ResearchIssuesonDataMiningandKnowledgeDiscovery
∙ECDM–EuropeanConferenceonDataMining
∙ECML-PKDD–EuropeanConferenceonMachineLearningandPrinciplesandPracticeofKnowledgeDiscoveryinDatabases
∙EDM–InternationalConferenceonEducationalDataMining
∙ICDM–IEEEInternationalConferenceonDataMining
∙KDD-ACMSIGKDDConferenceonKnowledgeDiscoveryandDataMining
∙MLDM–MachineLearningandDataMininginPatternRecognition
∙PAKDD–TheannualPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining
∙PAW–PredictiveAnalyticsWorld
∙SDM–SIAMInternationalConferenceonDataMining(SIAM)
∙SSTD-SymposiumonSpatialandTemporalDatabases
Dataminingtopicsarepresentonmostdatamanagement/databaseconferences.
[edit]Process
TheKnowledgeDiscoveryinDatabases(KDD)processiscommonlydefinedwiththestages
(1)Selection
(2)Preprocessing(3)Transformation(4)DataMining(5)Interpretation/Evaluation.[1]ItexistshoweverinmanyvariationsofthisthemesuchastheCRossIndustryStandardProcessforDataMining(CRISP-DM)whichdefinessixphases:
(1)BusinessUnderstanding,
(2)DataUnderstanding,(3)DataPreparation,(4)Modeling,(5)Evaluation,and(6)Deploymentorasimplifiedprocesssuchas
(1)Pre-processing,
(2)Datamining,and(3)Resultsvalidation.
[edit]Pre-processing
Beforedataminingalgorithmscanbeused,atargetdatasetmustbeassembled.Asdataminingcanonlyuncoverpatternsactuallypresentinthedata,thetargetdatasetmustbelargeenoughtocontainthesepatternswhileremainingconciseenoughtobeminedinanacceptabletimeframe.Acommonsourcefordataisadatamartordatawarehouse.Pre-processisessentialtoanalysethemultivariatedatasetsbeforedatamining.
Thetargetsetisthencleaned.Datacleaningremovestheobservationswithnoiseandmissingdata.
[edit]Datamining
Datamininginvolvessixcommonclassesoftasks:
[1]
∙Anomalydetection(Outlier/change/deviationdetection)-Theidentificationofunusualdatarecords,thatmightbeinterestingordataerrorsandrequirefurtherinvestigation.
∙Associationrulelearning(Dependencymodeling)–Searchesforrelationshipsbetweenvariables.Forexampleasupermarketmightgatherdataoncustomerpurchasinghabits.Usingassociationrulelearning,thesupermarketcandeterminewhichproductsarefrequentlyboughttogetherandusethisinformationformarketingpurposes.Thisissometimesreferredtoasmarketbasketanalysis.
∙Clustering–isthetaskofdiscoveringgroupsandstructuresinthedatathatareinsomewayoranother"similar",withoutusingknownstructuresinthedata.
∙Classification–isthetaskofgeneralizingknownstructuretoapplytonewdata.Forexample,anemailprogrammightattempttoclassifyanemailaslegitimateorspam.
∙Regression–Attemptstofindafunctionwhichmodelsthedatawiththeleasterror.
∙Summarization-providingamorecompactrepresentationofthedataset,includingvisualizationandreportgeneration.
[edit]Resultsvalidation
Thissectionismissinginformationaboutnon-classificationtasksindatamining,itonlycoversmachinelearning.Thisconcernhasbeennotedonthetalkpagewherewhetherornottoincludesuchinformationmaybediscussed.(September2011)
Thefinalstepofknowledgediscoveryfromdataistoverifythepatternsproducedbythedataminingalgorithmsoccurinthewiderdataset.Notallpatternsfoundbythedataminingalgorithmsarenecessarilyvalid.Itiscommonforthedataminingalgorithmstofindpatternsinthetrainingsetwhicharenotpresentinthegeneraldataset.Thisiscalledoverfitting.Toovercomethis,theevaluationusesatestsetofdataonwhichthedataminingalgorithmwasnottrained.Thelearnedpatternsareappliedtothistestsetandtheresultingoutputiscomparedtothedesiredoutput.Forexample,adataminingalgorithmtryingtodistinguishspamfromlegitimateemailswouldbetrainedonatrainingset