article1.docx

上传人:b****5 文档编号:4944626 上传时间:2022-12-12 格式:DOCX 页数:18 大小:34.67KB
下载 相关 举报
article1.docx_第1页
第1页 / 共18页
article1.docx_第2页
第2页 / 共18页
article1.docx_第3页
第3页 / 共18页
article1.docx_第4页
第4页 / 共18页
article1.docx_第5页
第5页 / 共18页
点击查看更多>>
下载资源
资源描述

article1.docx

《article1.docx》由会员分享,可在线阅读,更多相关《article1.docx(18页珍藏版)》请在冰豆网上搜索。

article1.docx

article1

Datamining

FromWikipedia,thefreeencyclopedia

Jumpto:

navigation,search

Nottobeconfusedwithanalytics,informationextraction,machinelearning,ordataanalysis.

DataMining(theanalysisstepoftheKnowledgeDiscoveryinDatabasesprocess,[1]orKDD),arelativelyyoungandinterdisciplinaryfieldofcomputerscience,[2][3]istheprocessofdiscoveringnewpatternsfromlargedatasetsinvolvingmethodsfromstatisticsandartificialintelligencebutalsodatabasemanagement.Incontrasttomachinelearning,theemphasisliesonthediscoveryofpreviouslyunknownpatternsasopposedtogeneralizingknownpatternstonewdata.

Thetermisabuzzword,andisfrequentlymisusedtomeananyformoflargescaledataorinformationprocessing(collection,extraction,warehousing,analysisandstatistics)butalsogeneralizedtoanykindofcomputerdecisionsupportsystemincludingartificialintelligence,machinelearningandbusinessintelligence.Intheproperuseoftheword,thekeytermisdiscovery,commonlydefinedas"detectingsomethingnew".Eventhepopularbook"Datamining:

PracticalmachinelearningtoolsandtechniqueswithJava"[4](whichcoversmostlymachinelearningmaterial)wasoriginallytobenamedjust"Practicalmachinelearning",andtheterm"datamining"wasonlyaddedformarketingreasons.[5]Oftenthemoregeneralterms"(largescale)dataanalysis"or"analytics"aremoreappropriate.

Theactualdataminingtaskistheautomaticorsemi-automaticanalysisoflargequantitiesofdatainordertoextractpreviouslyunknowninterestingpatternssuchasgroupsofdatarecords(clusteranalysis),unusualrecords(anomalydetection)anddependencies(associationrulemining).Thesepatternscanthenbeseenasakindofsummaryoftheinputdata,andusedinfurtheranalysisorforexampleinmachinelearningandpredictiveanalytics.Forexample,thedataminingstepmightidentifymultiplegroupsinthedata,whichcanthenbeusedtoobtainmoreaccuratepredictionresultsbyadecisionsupportsystem.Neitherthedatacollection,datapreparationnorresultinterpretationandreportingarepartofthedataminingstep,butdobelongtotheoveralldataminingprocessasadditionalsteps.

Therelatedtermsdatadredging,datafishinganddatasnoopingrefertotheuseofdataminingmethodstosamplepartsofalargerpopulationdatasetthatare(ormaybe)toosmallforreliablestatisticalinferencestobemadeaboutthevalidityofanypatternsdiscovered.Thesemethodscan,however,beusedincreatingnewhypothesestotestagainstthelargerdatapopulations.

Contents

∙1Background

o1.1Researchandevolution

∙2Process

o2.1Pre-processing

o2.2Datamining

o2.3Resultsvalidation

∙3Standards

∙4Notableuses

o4.1Games

o4.2Business

o4.3Scienceandengineering

o4.4Spatialdatamining

▪4.4.1Challenges

o4.5VisualDataMining

o4.6Surveillance

▪4.6.1Patternmining

▪4.6.2Subject-baseddatamining

∙5Privacyconcernsandethics

∙6Marketplacesurveys

∙7Groupsandassociations

∙8Seealso

o8.1Methods

o8.2Applicationdomains

o8.3Applicationexamples

o8.4Miscellaneous

o8.5Relatedtopics

o8.6Commercialdata-miningsoftwareandapplications

o8.7Freelibreopensourcedata-miningsoftwareandapplications

∙9References

∙10Furtherreading

∙11Externallinks

[edit]Background

Themanualextractionofpatternsfromdatahasoccurredforcenturies.EarlymethodsofidentifyingpatternsindataincludeBayes'theorem(1700s)andregressionanalysis(1800s).Theproliferation,ubiquityandincreasingpowerofcomputertechnologyhasincreaseddatacollection,storageandmanipulations.Asdatasetshavegrowninsizeandcomplexity,directhands-ondataanalysishasincreasinglybeenaugmentedwithindirect,automaticdataprocessing.Thishasbeenaidedbyotherdiscoveriesincomputerscience,suchasneuralnetworks,clustering,geneticalgorithms(1950s),decisiontrees(1960s)andsupportvectormachines(1990s).Dataminingistheprocessofapplyingthesemethodstodatawiththeintentionofuncoveringhiddenpatterns.[6]Ithasbeenusedformanyyearsbybusinesses,scientistsandgovernmentstosiftthroughvolumesofdatasuchasairlinepassengertriprecords,censusdataandsupermarketscannerdatatoproducemarketresearchreports.(Note,however,thatreportingisnotalwaysconsideredtobedatamining.)

Aprimaryreasonforusingdataminingistoassistintheanalysisofcollectionsofobservationsofbehavior.Suchdataarevulnerabletocollinearitybecauseofunknowninterrelations.Anunavoidablefactofdataminingisthatthe(sub-)set(s)ofdatabeinganalyzedmaynotberepresentativeofthewholedomain,andthereforemaynotcontainexamplesofcertaincriticalrelationshipsandbehaviorsthatexistacrossotherpartsofthedomain.Toaddressthissortofissue,theanalysismaybeaugmentedusingexperiment-basedandotherapproaches,suchaschoicemodellingforhuman-generateddata.Inthesesituations,inherentcorrelationscanbeeithercontrolledfor,orremovedaltogether,duringtheconstructionoftheexperimentaldesign.

[edit]Researchandevolution

ThepremierprofessionalbodyinthefieldistheAssociationforComputingMachinery'sSpecialInterestGrouponKnowledgediscoveryandDataMining(SIGKDD).Since1989theyhavehostedanannualinternationalconferenceandpublisheditsproceedings,[7]andsince1999havepublishedabiannualacademicjournaltitled"SIGKDDExplorations".[8]

Computerscienceconferencesondatamininginclude:

∙CIKM-ACMConferenceonInformationandKnowledgeManagement

∙DMIN–InternationalConferenceonDataMining

∙DMKD–ResearchIssuesonDataMiningandKnowledgeDiscovery

∙ECDM–EuropeanConferenceonDataMining

∙ECML-PKDD–EuropeanConferenceonMachineLearningandPrinciplesandPracticeofKnowledgeDiscoveryinDatabases

∙EDM–InternationalConferenceonEducationalDataMining

∙ICDM–IEEEInternationalConferenceonDataMining

∙KDD-ACMSIGKDDConferenceonKnowledgeDiscoveryandDataMining

∙MLDM–MachineLearningandDataMininginPatternRecognition

∙PAKDD–TheannualPacific-AsiaConferenceonKnowledgeDiscoveryandDataMining

∙PAW–PredictiveAnalyticsWorld

∙SDM–SIAMInternationalConferenceonDataMining(SIAM)

∙SSTD-SymposiumonSpatialandTemporalDatabases

Dataminingtopicsarepresentonmostdatamanagement/databaseconferences.

[edit]Process

TheKnowledgeDiscoveryinDatabases(KDD)processiscommonlydefinedwiththestages

(1)Selection

(2)Preprocessing(3)Transformation(4)DataMining(5)Interpretation/Evaluation.[1]ItexistshoweverinmanyvariationsofthisthemesuchastheCRossIndustryStandardProcessforDataMining(CRISP-DM)whichdefinessixphases:

(1)BusinessUnderstanding,

(2)DataUnderstanding,(3)DataPreparation,(4)Modeling,(5)Evaluation,and(6)Deploymentorasimplifiedprocesssuchas

(1)Pre-processing,

(2)Datamining,and(3)Resultsvalidation.

[edit]Pre-processing

Beforedataminingalgorithmscanbeused,atargetdatasetmustbeassembled.Asdataminingcanonlyuncoverpatternsactuallypresentinthedata,thetargetdatasetmustbelargeenoughtocontainthesepatternswhileremainingconciseenoughtobeminedinanacceptabletimeframe.Acommonsourcefordataisadatamartordatawarehouse.Pre-processisessentialtoanalysethemultivariatedatasetsbeforedatamining.

Thetargetsetisthencleaned.Datacleaningremovestheobservationswithnoiseandmissingdata.

[edit]Datamining

Datamininginvolvessixcommonclassesoftasks:

[1]

∙Anomalydetection(Outlier/change/deviationdetection)-Theidentificationofunusualdatarecords,thatmightbeinterestingordataerrorsandrequirefurtherinvestigation.

∙Associationrulelearning(Dependencymodeling)–Searchesforrelationshipsbetweenvariables.Forexampleasupermarketmightgatherdataoncustomerpurchasinghabits.Usingassociationrulelearning,thesupermarketcandeterminewhichproductsarefrequentlyboughttogetherandusethisinformationformarketingpurposes.Thisissometimesreferredtoasmarketbasketanalysis.

∙Clustering–isthetaskofdiscoveringgroupsandstructuresinthedatathatareinsomewayoranother"similar",withoutusingknownstructuresinthedata.

∙Classification–isthetaskofgeneralizingknownstructuretoapplytonewdata.Forexample,anemailprogrammightattempttoclassifyanemailaslegitimateorspam.

∙Regression–Attemptstofindafunctionwhichmodelsthedatawiththeleasterror.

∙Summarization-providingamorecompactrepresentationofthedataset,includingvisualizationandreportgeneration.

[edit]Resultsvalidation

Thissectionismissinginformationaboutnon-classificationtasksindatamining,itonlycoversmachinelearning.Thisconcernhasbeennotedonthetalkpagewherewhetherornottoincludesuchinformationmaybediscussed.(September2011)

Thefinalstepofknowledgediscoveryfromdataistoverifythepatternsproducedbythedataminingalgorithmsoccurinthewiderdataset.Notallpatternsfoundbythedataminingalgorithmsarenecessarilyvalid.Itiscommonforthedataminingalgorithmstofindpatternsinthetrainingsetwhicharenotpresentinthegeneraldataset.Thisiscalledoverfitting.Toovercomethis,theevaluationusesatestsetofdataonwhichthedataminingalgorithmwasnottrained.Thelearnedpatternsareappliedtothistestsetandtheresultingoutputiscomparedtothedesiredoutput.Forexample,adataminingalgorithmtryingtodistinguishspamfromlegitimateemailswouldbetrainedonatrainingset

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 表格模板 > 调查报告

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1