Distributed Naive Bayes for Data Classification.docx

资源描述

Distributed Naive Bayes for Data Classification.docx

《Distributed Naive Bayes for Data Classification.docx》由会员分享，可在线阅读，更多相关《Distributed Naive Bayes for Data Classification.docx（11页珍藏版）》请在冰豆网上搜索。

Distributed Naive Bayes for Data Classification.docx

DistributedNaiveBayesforDataClassification

-*****University-

DistributedNaiveBayesforDataClassification

CloudComputing

Name:

****

Sno:

****

Major:

***

DistributedNaiveBayesforDataClassification

个人信息*************

MachinelearningalgorithmshavetheadvantageofmakinguseofthepowerfulHadoopdistributedcomputingplatformandtheMapReduceprogrammingmodeltoprocessdatainparallel.ManymachinelearningalgorithmshavebeeninvestigatedtobetransformedtotheMapReduceparadigminordertomakeuseoftheHadoopDistributedFileSystem（HDFS）.NaïveBayesclassifierisoneofthesupervisedlearningclassificationalgorithmsthatcanbeprogrammedinformofMapReduce.Inthiswork,webuildaNaïveBayesMapReducemodelandevaluatetheclassifieronadatasetformUCIbasedonthepredictionaccuracy.

1.Hadoopoverview

Duetothechallengesbroughtupbyvolume,velocityandvariety,anewtechnologyisrequiredforBigData.ApacheHadoopisplayingaleadingroleintheBigDatafieldcurrentlyanditisthefirstviableplatformforBigDataanalytics.Hadoopisanopen-sourcesoftwareframeworkforscalable,reliable,distributedcomputingsystemthatiscapableofgettingahandleonBigData’s“threeVs”challenges.OriginallyinspiredbyGoogle’sMapReduceandGoogleFileSystem（GFS）,whatHadoopdoesistouseasimpleprogrammingmodeltoprocesslarge-scaledatasetsacrossclustersofmachinesanddistributethestorage.Sincethedataprocessingisrunningonaclusterofmachines,itisnecessarytodealwithnodefailurethatislikelytooccurduringthecourseoftheprocessing.Insteadofrelyingonhighlyexpensiveserverswithhighfaulttolerance,Hadoophandlesnodefailureitselfthroughitsservice,whichisabletodetectthenodefailureintheclusterandre-distributethedatatootheravailablemachines.Inaddition,Hadoopsetsupaschemetoprotectitfromlosingthemetadataofthedistributedenvironment.Therefore,Hadoopbecomeswidelyemployedbymanyorganizationsbecauseofitsreliabilityandscalabilitytoprocessvastquantitiesofdatawithanaffordablecostofdistributedcomputinginfrastructure.

Hadoopconsistsoftwoimportantelements.Thefirsthigh-performancedistributeddataprocessingframeworkcalledMapReduce.And,thisworkalsousesthefirstframework.Hadoopbreaksdownthedatasetsintomultiplepartitionsanddistributeitsstorageoverthecluster.MapReduceperformsdataprocessingoneachserversagainsttheblocksofdataresidingonthatmachine–whichsavesagreatamountoftimeduetoparallelprocessing.Thisemitsintermediatesummarieswhichareaggregatedandresolvedtothefinalresultinareducestage.Specifically,theMapReduceparadigmconsistsoftwomajorsteps:

mapstepandreducestep（asshowninFigure1）–themapstepconvertstheinputpartitionofdataintoakey/valuepairwhichoperatesparallelinthecluster,andthereducetaskcollectsthedata,performssomecomputationandresolvesthemintoasinglevalue.

ThesecondelementofHadoopistheHadoopDistributedFileSystem（HDFS）,whichpermitshigh-bandwidthcomputationanddistributedlow-coststoragewhichisessentialforBigDatatasks.

Figure1.IllustrationoftheMapReduceframework

2.NaïveBayesclassificationalgorithm

TheNaïveBayesalgorithmisoneofthemostimportantsupervisedmachinelearningalgorithmsforclassification.ThisclassifierisasimpleprobabilisticclassifierbasedonapplyingBayes’theoremasfollows:

NaïveBayesclassificationhasanassumptionthatattributeprobabilities

areindependentgiventhe

class,where

attributeofthedatainstance.Thisassumptionreducesthecomplexityoftheproblemtopracticalandcanbesolvedeasily.Despitethesimplificationoftheproblem,theNaïveBayesclassifierstillgivesusahighdegreeofaccuracy.

2.1.Formulation

Foradatainstancedandaclassc,basedontheBayestheoremwehave:

Where

istheprobabilityofaclassgivenadatainstance.

istheprobabilityofadatainstancegiventheclass,

istheprobabilityoftheclass,and

istheprobabilityofadocument.

istheprobabilityweusetochoosetheclass,specifically,wearelookingforaclassthatmaximizes

outofallclassesforagivendatainstanceasshowninthefollowingequation:

Where

istheclassthathasmaximumaposteriori（MAP）,ormaximumprobability

.Notably,theprobabilityofthedatainstanceisaconstant,whichisdroppedfromtheequationabove.Wecall

thelikelihood,whichistheprobabilityofagivenclass,andcall

theprior,whichistheprobabilityoftheclass.Wecanrepresent

tobetheprobabilityofavectorofattributesconditioningontheclass,asfollows:

Withtheassumptionthattheattributeprobabilities

areindependentgiventheclassc,wehavetheprobabilityofasetofattributesgiventheclasstobetheproductofawholebunchofindependentprobabilities.

HencethebestclassofNaïveBayesclassifierwillbe:

2.2.Parameterextimation

UponobtainingthemathematicalmodeloftheNaïveBayesclassifier,thenextstepistoestimatetheparametersinthemodel–thepriorofeachclassandthedistributionsofattributes.Theprioroftheclass

canbeestimatedbysimplycalculatinganestimatefortheprobabilityoftheclassinourtrainingsample:

Withregardtoestimationofattributedistributions,therearethreemodelsthatarewidelyusedinapplications–GaussianNaïveBayes,MultinomialNaïveBayesandBernoulliNaïveBayes.GaussianNaïveBayesismostlyusedwhendealingwithcontinuousdata,whiletheothertwomodelsiswellsuitedfordiscretedata.FormultinomialNaïveBayes,theprobabilityof

attributehasvalueaconditioningonagivenclass

canbeestimatedbyusingthetrainingdataset:

2.3.MeasurementofNaïveBayesclassifier

Inthiswork,weusetheAccuracytomeasureourmethod.Accuracyequalsthetruepositives,plusthetruenegativesoverallfourclasses（truepositives,truenegatives,falsepositivesandfalsenegatives）.Inmanyapplications,accuracyisausefulmeasurefortheNaïveBayesclassifier.Butthereisaparticularscenariowhendealingwiththingsthatareuncommon,inwhichtheaccuracyisnotausefulmeasure.Forexample,99.99%ofthedataarefromcategoryA,whileonly0.01%isfromthecounterpartcategoryB.So,wewillhave99.99%ofaccuracyevenwhenourclassifierassignedalldataintocategoryB,whichisapparentlyundesired.Forthissituation,precisionandrecallareusedasthemeasurementofourclassifier.

3.ExperimentsandResults

3.1.Experimentalenvironments

Asweallknow,Hadooprunsonlinuxorunixoperatingsystem.However,myoperatingsystemiswindows.AvirtualmachinecanhelpmefabricatealinuxoperatingsystemlikeUbuntu.IuseVirtualBoxhere,butotherpeopleprobablyuseVMwareorotherbettermachine.Fig.2givestheresultachievedintheend.TherealCPUisInter（R）Core（TM）i3-2330M2.20GHzandthevirtualMemoryis1024MB.Andtable1givesadescriptionaboutourexperimentalenvironments.

Table1Experimentalenvironments

OperatingSystem

Ubuntu12.4（1GMemory,2.20GHz）

JRE

JDK1.6

HadoopVersion

Hadoop-0.20.2

EclipseVersion

Eclipse-SDK-3.3.1

Figure2.Experimentalenvironmentsaboutthiswork

3.2.Experimentdatasets

Inthisword,weusethedatasetformUCIprovidedbytheteachertotesttheaccuracyandspeedofourmethodrespectively.Theoriginaldataisdividedintotrainingandtestingset.Figure3giveasimpledescriptionforthefirstdatasetusedtotesttheaccuracy.

Figure3.thefirstdatasetusedtotesttheaccuracy

Intheoriginaldata,the“1.txt”contains5,000,000trainingsamples.Itcontains102columns.ThefirstcolumnisID,the2ndtothe101thcolumnistheattributes,andthelastcolumnistheclassification.The“2.txt”contains500,000samplestobeclassified.Itcontains101columns,whichisthesamestructuretothe“1.txt”file’sfirst101columns.

3.3.Experimentalresults

TheresultaccuracyrateforpartialrunisshowninFigure4.

Figure4.Partialresultaboutaccuracy

Wecanseethattheaccuracyrateisnotveryhigh.Wecansaythatthereisnobiasforeachtrainingset.Astheerrorrateofthethirdrunisslightlylowerthantherestoftheruns,weusetheestimatedmodelofthisruntopredicttheclassificationofthe“unknown”data.Andthecomparisonresultofclasslabelisshowninfigure5.

Figure5.Comparisonresultofclasslabel

Besides,theclasstimesisshowninfigure6.

Figure6.Theclasstimes

4.Conclusions

Inthiswork,webuiltaNaïveBayesclassifierbyleveragingtheMapReducefunctionandalsoperformascalabilityanalysistoseetherelationshipbetweentherunningtimeandthesizeofthecluster.Itturnsoutthat,withoutreducingtheaccuracy,adistributedNaïveBayesclassifierhasamuchhigherperformancecomparedwiththerunningofthealgorithmonasinglemachine.AMapReduceversionoftheNaïveBayesclassifierturnsouttobeextremelyefficientwhendealingwithlargeamountofdata.

OurworkinthispaperisonlyasmallsteptowardsleveragingmachinelearningalgorithmsusingtheMapReducemodel.Inthefuture,onedirectioncanbeexperimentingwithmoremachinelearningalgorithmsusingMapReduceandusingMahoutAPItobenchmarkourexperiments.TheotheroptioncouldbetounderstandmoreinternalsystemdesignofHadoopframeworksotobetterutilizeourclusterresourcesforthejobwecreated,forexample,investigatingtheoptimalnumberofmappersandreducersusedforajobtomaximizethethroughputofourlarge-scaledataprocessingjobs.

Appendix

展开阅读全文