Distributed Naive Bayes for Data Classification.docx
《Distributed Naive Bayes for Data Classification.docx》由会员分享,可在线阅读,更多相关《Distributed Naive Bayes for Data Classification.docx(11页珍藏版)》请在冰豆网上搜索。
DistributedNaiveBayesforDataClassification
-*****University-
DistributedNaiveBayesforDataClassification
CloudComputing
Name:
****
Sno:
****
Major:
***
DistributedNaiveBayesforDataClassification
个人信息*************
MachinelearningalgorithmshavetheadvantageofmakinguseofthepowerfulHadoopdistributedcomputingplatformandtheMapReduceprogrammingmodeltoprocessdatainparallel.ManymachinelearningalgorithmshavebeeninvestigatedtobetransformedtotheMapReduceparadigminordertomakeuseoftheHadoopDistributedFileSystem(HDFS).NaïveBayesclassifierisoneofthesupervisedlearningclassificationalgorithmsthatcanbeprogrammedinformofMapReduce.Inthiswork,webuildaNaïveBayesMapReducemodelandevaluatetheclassifieronadatasetformUCIbasedonthepredictionaccuracy.
1.Hadoopoverview
Duetothechallengesbroughtupbyvolume,velocityandvariety,anewtechnologyisrequiredforBigData.ApacheHadoopisplayingaleadingroleintheBigDatafieldcurrentlyanditisthefirstviableplatformforBigDataanalytics.Hadoopisanopen-sourcesoftwareframeworkforscalable,reliable,distributedcomputingsystemthatiscapableofgettingahandleonBigData’s“threeVs”challenges.OriginallyinspiredbyGoogle’sMapReduceandGoogleFileSystem(GFS),whatHadoopdoesistouseasimpleprogrammingmodeltoprocesslarge-scaledatasetsacrossclustersofmachinesanddistributethestorage.Sincethedataprocessingisrunningonaclusterofmachines,itisnecessarytodealwithnodefailurethatislikelytooccurduringthecourseoftheprocessing.Insteadofrelyingonhighlyexpensiveserverswithhighfaulttolerance,Hadoophandlesnodefailureitselfthroughitsservice,whichisabletodetectthenodefailureintheclusterandre-distributethedatatootheravailablemachines.Inaddition,Hadoopsetsupaschemetoprotectitfromlosingthemetadataofthedistributedenvironment.Therefore,Hadoopbecomeswidelyemployedbymanyorganizationsbecauseofitsreliabilityandscalabilitytoprocessvastquantitiesofdatawithanaffordablecostofdistributedcomputinginfrastructure.
Hadoopconsistsoftwoimportantelements.Thefirsthigh-performancedistributeddataprocessingframeworkcalledMapReduce.And,thisworkalsousesthefirstframework.Hadoopbreaksdownthedatasetsintomultiplepartitionsanddistributeitsstorageoverthecluster.MapReduceperformsdataprocessingoneachserversagainsttheblocksofdataresidingonthatmachine–whichsavesagreatamountoftimeduetoparallelprocessing.Thisemitsintermediatesummarieswhichareaggregatedandresolvedtothefinalresultinareducestage.Specifically,theMapReduceparadigmconsistsoftwomajorsteps:
mapstepandreducestep(asshowninFigure1)–themapstepconvertstheinputpartitionofdataintoakey/valuepairwhichoperatesparallelinthecluster,andthereducetaskcollectsthedata,performssomecomputationandresolvesthemintoasinglevalue.
ThesecondelementofHadoopistheHadoopDistributedFileSystem(HDFS),whichpermitshigh-bandwidthcomputationanddistributedlow-coststoragewhichisessentialforBigDatatasks.
Figure1.IllustrationoftheMapReduceframework
2.NaïveBayesclassificationalgorithm
TheNaïveBayesalgorithmisoneofthemostimportantsupervisedmachinelearningalgorithmsforclassification.ThisclassifierisasimpleprobabilisticclassifierbasedonapplyingBayes’theoremasfollows:
NaïveBayesclassificationhasanassumptionthatattributeprobabilities
areindependentgiventhe
class,where
is
attributeofthedatainstance.Thisassumptionreducesthecomplexityoftheproblemtopracticalandcanbesolvedeasily.Despitethesimplificationoftheproblem,theNaïveBayesclassifierstillgivesusahighdegreeofaccuracy.
2.1.Formulation
Foradatainstancedandaclassc,basedontheBayestheoremwehave:
Where
istheprobabilityofaclassgivenadatainstance.
istheprobabilityofadatainstancegiventheclass,
istheprobabilityoftheclass,and
istheprobabilityofadocument.
istheprobabilityweusetochoosetheclass,specifically,wearelookingforaclassthatmaximizes
outofallclassesforagivendatainstanceasshowninthefollowingequation:
Where
istheclassthathasmaximumaposteriori(MAP),ormaximumprobability
.Notably,theprobabilityofthedatainstanceisaconstant,whichisdroppedfromtheequationabove.Wecall
thelikelihood,whichistheprobabilityofagivenclass,andcall
theprior,whichistheprobabilityoftheclass.Wecanrepresent
tobetheprobabilityofavectorofattributesconditioningontheclass,asfollows:
Withtheassumptionthattheattributeprobabilities
areindependentgiventheclassc,wehavetheprobabilityofasetofattributesgiventheclasstobetheproductofawholebunchofindependentprobabilities.
HencethebestclassofNaïveBayesclassifierwillbe:
2.2.Parameterextimation
UponobtainingthemathematicalmodeloftheNaïveBayesclassifier,thenextstepistoestimatetheparametersinthemodel–thepriorofeachclassandthedistributionsofattributes.Theprioroftheclass
canbeestimatedbysimplycalculatinganestimatefortheprobabilityoftheclassinourtrainingsample:
Withregardtoestimationofattributedistributions,therearethreemodelsthatarewidelyusedinapplications–GaussianNaïveBayes,MultinomialNaïveBayesandBernoulliNaïveBayes.GaussianNaïveBayesismostlyusedwhendealingwithcontinuousdata,whiletheothertwomodelsiswellsuitedfordiscretedata.FormultinomialNaïveBayes,theprobabilityof
attributehasvalueaconditioningonagivenclass
canbeestimatedbyusingthetrainingdataset:
2.3.MeasurementofNaïveBayesclassifier
Inthiswork,weusetheAccuracytomeasureourmethod.Accuracyequalsthetruepositives,plusthetruenegativesoverallfourclasses(truepositives,truenegatives,falsepositivesandfalsenegatives).Inmanyapplications,accuracyisausefulmeasurefortheNaïveBayesclassifier.Butthereisaparticularscenariowhendealingwiththingsthatareuncommon,inwhichtheaccuracyisnotausefulmeasure.Forexample,99.99%ofthedataarefromcategoryA,whileonly0.01%isfromthecounterpartcategoryB.So,wewillhave99.99%ofaccuracyevenwhenourclassifierassignedalldataintocategoryB,whichisapparentlyundesired.Forthissituation,precisionandrecallareusedasthemeasurementofourclassifier.
3.ExperimentsandResults
3.1.Experimentalenvironments
Asweallknow,Hadooprunsonlinuxorunixoperatingsystem.However,myoperatingsystemiswindows.AvirtualmachinecanhelpmefabricatealinuxoperatingsystemlikeUbuntu.IuseVirtualBoxhere,butotherpeopleprobablyuseVMwareorotherbettermachine.Fig.2givestheresultachievedintheend.TherealCPUisInter(R)Core(TM)i3-2330M2.20GHzandthevirtualMemoryis1024MB.Andtable1givesadescriptionaboutourexperimentalenvironments.
Table1Experimentalenvironments
OperatingSystem
Ubuntu12.4(1GMemory,2.20GHz)
JRE
JDK1.6
HadoopVersion
Hadoop-0.20.2
EclipseVersion
Eclipse-SDK-3.3.1
Figure2.Experimentalenvironmentsaboutthiswork
3.2.Experimentdatasets
Inthisword,weusethedatasetformUCIprovidedbytheteachertotesttheaccuracyandspeedofourmethodrespectively.Theoriginaldataisdividedintotrainingandtestingset.Figure3giveasimpledescriptionforthefirstdatasetusedtotesttheaccuracy.
Figure3.thefirstdatasetusedtotesttheaccuracy
Intheoriginaldata,the“1.txt”contains5,000,000trainingsamples.Itcontains102columns.ThefirstcolumnisID,the2ndtothe101thcolumnistheattributes,andthelastcolumnistheclassification.The“2.txt”contains500,000samplestobeclassified.Itcontains101columns,whichisthesamestructuretothe“1.txt”file’sfirst101columns.
3.3.Experimentalresults
TheresultaccuracyrateforpartialrunisshowninFigure4.
Figure4.Partialresultaboutaccuracy
Wecanseethattheaccuracyrateisnotveryhigh.Wecansaythatthereisnobiasforeachtrainingset.Astheerrorrateofthethirdrunisslightlylowerthantherestoftheruns,weusetheestimatedmodelofthisruntopredicttheclassificationofthe“unknown”data.Andthecomparisonresultofclasslabelisshowninfigure5.
Figure5.Comparisonresultofclasslabel
Besides,theclasstimesisshowninfigure6.
Figure6.Theclasstimes
4.Conclusions
Inthiswork,webuiltaNaïveBayesclassifierbyleveragingtheMapReducefunctionandalsoperformascalabilityanalysistoseetherelationshipbetweentherunningtimeandthesizeofthecluster.Itturnsoutthat,withoutreducingtheaccuracy,adistributedNaïveBayesclassifierhasamuchhigherperformancecomparedwiththerunningofthealgorithmonasinglemachine.AMapReduceversionoftheNaïveBayesclassifierturnsouttobeextremelyefficientwhendealingwithlargeamountofdata.
OurworkinthispaperisonlyasmallsteptowardsleveragingmachinelearningalgorithmsusingtheMapReducemodel.Inthefuture,onedirectioncanbeexperimentingwithmoremachinelearningalgorithmsusingMapReduceandusingMahoutAPItobenchmarkourexperiments.TheotheroptioncouldbetounderstandmoreinternalsystemdesignofHadoopframeworksotobetterutilizeourclusterresourcesforthejobwecreated,forexample,investigatingtheoptimalnumberofmappersandreducersusedforajobtomaximizethethroughputofourlarge-scaledataprocessingjobs.
Appendix
1