数据挖掘教程英文版.docx
《数据挖掘教程英文版.docx》由会员分享,可在线阅读,更多相关《数据挖掘教程英文版.docx(13页珍藏版)》请在冰豆网上搜索。
数据挖掘教程英文版
DataMiningTutorial
Author:
SethPaulJamieMacLennanZhaohuiTangScottOveson
Abstract:
Microsoft®SQLServer™2005providesanintegratedenvironmentforcreatingandworkingwithdataminingmodels.This tutorialusesfourscenarios,targeted mailing, forecasting, market basket,and sequence clustering,todemonstratehowtousetheminingmodelalgorithms,miningmodelviewers,anddataminingtools thatareincludedinthisreleaseofSQLServer.
TheinformationcontainedinthisdocumentrepresentsthecurrentviewofMicrosoftCorporationontheissuesdiscussedasofthedateofpublication.BecauseMicrosoftmustrespondtochangingmarketconditions,itshouldnotbeinterpretedtobeacommitmentonthepartofMicrosoft,andMicrosoftcannotguaranteetheaccuracyofanyinformationpresentedafterthedateofpublication.
Thiswhitepaperisforinformationalpurposesonly.MICROSOFTMAKESNOWARRANTIES,EXPRESSORIMPLIED,ASTOTHEINFORMATIONINTHISDOCUMENT.
Complyingwithallapplicablecopyrightlawsistheresponsibilityoftheuser.Withoutlimitingtherightsundercopyright,nopartofthisdocumentmaybereproduced,storedinorintroducedintoaretrievalsystem,ortransmittedinanyformorbyanymeans(electronic,mechanical,photocopying,recording,orotherwise),orforanypurpose,withouttheexpresswrittenpermissionofMicrosoftCorporation.
Microsoftmayhavepatents,patentapplications,trademarks,copyrights,orotherintellectualpropertyrightscoveringsubjectmatterinthisdocument.ExceptasexpresslyprovidedinanywrittenlicenseagreementfromMicrosoft,thefurnishingofthisdocumentdoesnotgiveyouanylicensetothesepatents,trademarks,copyrights,orotherintellectualproperty.
©2003MicrosoftCorporation.Allrightsreserved.
MicrosoftiseitheraregisteredtrademarkoratrademarkofMicrosoftCorporationintheUnitedStatesand/orothercountries.
Thenamesofactualcompaniesandproductsmentionedhereinmaybethetrademarksoftheirrespectiveowner
Introduction
ThedataminingtutorialisdesignedtowalkyouthroughtheprocessofcreatingdataminingmodelsinMicrosoftSQLServer2005.ThedataminingalgorithmsandtoolsinSQLServer2005makeiteasytobuildacomprehensivesolutionforavarietyofprojects,includingmarketbasketanalysis,forecastinganalysis,andtargetedmailinganalysis.Thescenariosforthesesolutionsareexplainedingreaterdetaillaterinthetutorial.
ThemostvisiblecomponentsinSQLServer2005aretheworkspacesthatyouusetocreateandworkwithdataminingmodels.Theonlineanalyticalprocessing(OLAP)anddataminingtoolsareconsolidatedintotwoworkingenvironments:
BusinessIntelligenceDevelopmentStudioandSQLServerManagementStudio.UsingBusinessIntelligenceDevelopmentStudio,youcandevelopanAnalysisServicesprojectdisconnectedfromtheserver.Whentheprojectisready,youcandeployittotheserver.Youcanalsoworkdirectlyagainsttheserver.ThemainfunctionofSQLServerManagementStudioistomanagetheserver.Eachenvironmentisdescribedinmoredetaillaterinthisintroduction.Formoreinformationonchoosingbetweenthetwoenvironments,see"ChoosingBetweenSQLServerManagementStudioandBusinessIntelligenceDevelopmentStudio"inSQLServerBooksOnline.
Allofthedataminingtoolsexistinthedataminingeditor.Usingtheeditoryoucanmanageminingmodels,createnewmodels,viewmodels,comparemodels,andcreatepredictionsbasedonexistingmodels.
Afteryoubuildaminingmodel,youwillwanttoexploreit,lookingforinterestingpatternsandrules.Eachminingmodelviewerintheeditoriscustomizedtoexploremodelsbuiltwithaspecificalgorithm.Formoreinformationabouttheviewers,see"ViewingaDataMiningModel"inSQLServerBooksOnline.
Oftenyourprojectwillcontainseveralminingmodels,sobeforeyoucanuseamodeltocreatepredictions,youneedtobeabletodeterminewhichmodelisthemostaccurate.Forthisreason,theeditorcontainsamodelcomparisontoolcalledtheMiningAccuracyCharttab.Usingthistoolyoucancomparethepredictiveaccuracyofyourmodelsanddeterminethebestmodel.
Tocreatepredictions,youwillusetheDataMiningExtensions(DMX)language.DMXextendsSQL,containingcommandstocreate,modify,andpredictagainstminingmodels.FormoreinformationaboutDMX,see"DataMiningExtensions(DMX)Reference"inSQLServerBooksOnline.Becausecreatingapredictioncanbecomplicated,thedataminingeditorcontainsatoolcalledPredictionQueryBuilder,whichallowsyoutobuildqueriesusingagraphicalinterface.YoucanalsoviewtheDMXcodethatisgeneratedbythequerybuilder.
Justasimportantasthetoolsthatyouusetoworkwithandcreatedataminingmodelsarethemechanicsbywhichtheyarecreated.Thekeytocreatingaminingmodelisthedataminingalgorithm.Thealgorithmfindspatternsinthedatathatyoupassit,andittranslatesthemintoaminingmodel—itistheenginebehindtheprocess.SQLServer2005includesninealgorithms:
∙MicrosoftDecisionTrees
∙MicrosoftClustering
∙MicrosoftNaïveBayes
∙MicrosoftSequenceClustering
∙MicrosoftTimeSeries
∙MicrosoftAssociation
∙MicrosoftNeuralNetwork
∙MicrosoftLinearRegression
∙MicrosoftLogisticRegression
Usingacombinationoftheseninealgorithms,youcancreatesolutionstocommonbusinessproblems.Thesealgorithmsaredescribedinmoredetaillaterinthistutorial.
Someofthemostimportantstepsincreatingadataminingsolutionareconsolidating,cleaning,andpreparingthedatatobeusedtocreatetheminingmodels.SQLServer2005includestheDataTransformationServices(DTS)workingenvironment,whichcontainstoolsthatyoucanusetoclean,validate,andprepareyourdata.FormoreinformationonusingDTSinconjunctionwithadataminingsolution,see"DTSDataMiningTasksandTransformations"inSQLServerBooksOnline.
InordertodemonstratetheSQLServerdataminingfeatures,thistutorialusesanewsampledatabasecalledAdventureWorksDW.ThedatabaseisincludedwithSQLServer2005,anditsupportsOLAPanddataminingfunctionality.Inordertomakethesampledatabaseavailable,youneedtoselectthesampledatabaseattheinstallationtimeinthe“Advanced”dialogforcomponentselection.
Theaudienceforthistutorialisbusinessanalysts,developers,anddatabaseadministratorswhohaveuseddataminingtoolsbeforeandarefamiliarwithdataminingconcepts.Ifyouarenewtodatamining,download"PreparingandMiningDatawithMicrosoftSQLServer2000andAnalysisServices"(
AdventureWorks
AdventureWorksDWisbasedonafictionalbicyclemanufacturingcompanynamedAdventureWorksCycles.AdventureWorksproducesanddistributesmetalandcompositebicyclestoNorthAmerican,European,andAsiancommercialmarkets.ThebaseofoperationsislocatedinBothell,Washingtonwith500employees,andseveralregionalsalesteamsarelocatedthroughouttheirmarketbase.
AdventureWorkssellsproductswholesaletospecialtyshopsandtoindividualsthroughtheInternet.Forthedataminingexercises,youwillworkwiththeAdventureWorksDWInternetsalestables,whichcontainrealisticpatternsthatworkwellfordataminingexercises.
FormoreinformationonAdventureWorksCyclessee"SampleDatabasesandBusinessScenarios"inSQLServerBooksOnline.
DatabaseDetails
TheInternetsalesschemacontainsinformationabout9,242customers.Thesecustomersliveinsixcountries,whicharecombinedintothreeregions:
∙NorthAmerica(83%)
∙Europe(12%)
∙Australia(7%)
Thedatabasecontainsdataforthreefiscalyears:
2002,2003,and2004.
Theproductsinthedatabasearebrokendownbysubcategory,model,andproduct.
BusinessIntelligenceDevelopmentStudio
BusinessIntelligenceDevelopmentStudioisasetoftoolsdesignedforcreatingbusinessintelligenceprojects.BecauseBusinessIntelligenceDevelopmentStudiowascreatedasanIDEenvironmentinwhichyoucancreateacompletesolution,youworkdisconnectedfromtheserver.Youcanchangeyourdataminingobjectsasmuchasyouwant,butthechangesarenotreflectedontheserveruntilafteryoudeploytheproject.
WorkinginanIDEisbeneficialforthefollowingreasons:
∙YouhavepowerfulcustomizationtoolsavailabletoconfigureBusinessIntelligenceDevelopmentStudiotosuityourneeds.
∙YoucanintegrateyourAnalysisServicesprojectwithavarietyofotherbusinessintelligenceprojectsencapsulatingyourentiresolutionintoasingleview.
∙Fullsourcecontrolintegrationenablesyourentireteamtocollaborateincreatingacompletebusinessintelligencesolution.
TheAnalysisServicesprojectistheentrypointforabusinessintelligencesolution.AnAnalysisServicesprojectencapsulatesminingmodelsandOLAPcubes,alongwithsupplementalobjectsthatmakeuptheAnalysisServicesdatabase.FromBusinessIntelligenceDevelopmentStudio,youcancreateandeditAnalysisServicesobjectswithinaprojectanddeploytheprojecttotheappropriateAnalysisServicesserverorservers.
IfyouareworkingwithanexistingAnalysisServicesproject,youcanalsouseBusinessIntelligenceDevelopmentStudiotoworkconnectedtheserver.Inthisway,changesarereflecteddirectlyontheserverwithouthavingtodeploythesolution.
SQLServerManagementStudio
SQLServerManagementStudioisacollectionofadministrativeandscriptingtoolsforworkingwithMicrosoftSQLServercomponents.ThisworkspacediffersfromBusinessIntelligenceDevelopmentStudiointhatyouareworkinginaconnectedenvironmentwhereactionsareprop