Distributed Naive Bayes for Data Classification.docx

上传人:b****8 文档编号:11251834 上传时间:2023-02-26 格式:DOCX 页数:11 大小:658.70KB
下载 相关 举报
Distributed Naive Bayes for Data Classification.docx_第1页
第1页 / 共11页
Distributed Naive Bayes for Data Classification.docx_第2页
第2页 / 共11页
Distributed Naive Bayes for Data Classification.docx_第3页
第3页 / 共11页
Distributed Naive Bayes for Data Classification.docx_第4页
第4页 / 共11页
Distributed Naive Bayes for Data Classification.docx_第5页
第5页 / 共11页
点击查看更多>>
下载资源
资源描述

Distributed Naive Bayes for Data Classification.docx

《Distributed Naive Bayes for Data Classification.docx》由会员分享,可在线阅读,更多相关《Distributed Naive Bayes for Data Classification.docx(11页珍藏版)》请在冰豆网上搜索。

Distributed Naive Bayes for Data Classification.docx

DistributedNaiveBayesforDataClassification

-*****University-

 

DistributedNaiveBayesforDataClassification

CloudComputing

 

Name:

****

Sno:

****

Major:

***

 

DistributedNaiveBayesforDataClassification

个人信息*************

MachinelearningalgorithmshavetheadvantageofmakinguseofthepowerfulHadoopdistributedcomputingplatformandtheMapReduceprogrammingmodeltoprocessdatainparallel.ManymachinelearningalgorithmshavebeeninvestigatedtobetransformedtotheMapReduceparadigminordertomakeuseoftheHadoopDistributedFileSystem(HDFS).NaïveBayesclassifierisoneofthesupervisedlearningclassificationalgorithmsthatcanbeprogrammedinformofMapReduce.Inthiswork,webuildaNaïveBayesMapReducemodelandevaluatetheclassifieronadatasetformUCIbasedonthepredictionaccuracy.

1.Hadoopoverview

Duetothechallengesbroughtupbyvolume,velocityandvariety,anewtechnologyisrequiredforBigData.ApacheHadoopisplayingaleadingroleintheBigDatafieldcurrentlyanditisthefirstviableplatformforBigDataanalytics.Hadoopisanopen-sourcesoftwareframeworkforscalable,reliable,distributedcomputingsystemthatiscapableofgettingahandleonBigData’s“threeVs”challenges.OriginallyinspiredbyGoogle’sMapReduceandGoogleFileSystem(GFS),whatHadoopdoesistouseasimpleprogrammingmodeltoprocesslarge-scaledatasetsacrossclustersofmachinesanddistributethestorage.Sincethedataprocessingisrunningonaclusterofmachines,itisnecessarytodealwithnodefailurethatislikelytooccurduringthecourseoftheprocessing.Insteadofrelyingonhighlyexpensiveserverswithhighfaulttolerance,Hadoophandlesnodefailureitselfthroughitsservice,whichisabletodetectthenodefailureintheclusterandre-distributethedatatootheravailablemachines.Inaddition,Hadoopsetsupaschemetoprotectitfromlosingthemetadataofthedistributedenvironment.Therefore,Hadoopbecomeswidelyemployedbymanyorganizationsbecauseofitsreliabilityandscalabilitytoprocessvastquantitiesofdatawithanaffordablecostofdistributedcomputinginfrastructure.

Hadoopconsistsoftwoimportantelements.Thefirsthigh-performancedistributeddataprocessingframeworkcalledMapReduce.And,thisworkalsousesthefirstframework.Hadoopbreaksdownthedatasetsintomultiplepartitionsanddistributeitsstorageoverthecluster.MapReduceperformsdataprocessingoneachserversagainsttheblocksofdataresidingonthatmachine–whichsavesagreatamountoftimeduetoparallelprocessing.Thisemitsintermediatesummarieswhichareaggregatedandresolvedtothefinalresultinareducestage.Specifically,theMapReduceparadigmconsistsoftwomajorsteps:

mapstepandreducestep(asshowninFigure1)–themapstepconvertstheinputpartitionofdataintoakey/valuepairwhichoperatesparallelinthecluster,andthereducetaskcollectsthedata,performssomecomputationandresolvesthemintoasinglevalue.

ThesecondelementofHadoopistheHadoopDistributedFileSystem(HDFS),whichpermitshigh-bandwidthcomputationanddistributedlow-coststoragewhichisessentialforBigDatatasks.

Figure1.IllustrationoftheMapReduceframework

2.NaïveBayesclassificationalgorithm

TheNaïveBayesalgorithmisoneofthemostimportantsupervisedmachinelearningalgorithmsforclassification.ThisclassifierisasimpleprobabilisticclassifierbasedonapplyingBayes’theoremasfollows:

NaïveBayesclassificationhasanassumptionthatattributeprobabilities

areindependentgiventhe

class,where

is

attributeofthedatainstance.Thisassumptionreducesthecomplexityoftheproblemtopracticalandcanbesolvedeasily.Despitethesimplificationoftheproblem,theNaïveBayesclassifierstillgivesusahighdegreeofaccuracy.

2.1.Formulation

Foradatainstancedandaclassc,basedontheBayestheoremwehave:

Where

istheprobabilityofaclassgivenadatainstance.

istheprobabilityofadatainstancegiventheclass,

istheprobabilityoftheclass,and

istheprobabilityofadocument.

istheprobabilityweusetochoosetheclass,specifically,wearelookingforaclassthatmaximizes

outofallclassesforagivendatainstanceasshowninthefollowingequation:

Where

istheclassthathasmaximumaposteriori(MAP),ormaximumprobability

.Notably,theprobabilityofthedatainstanceisaconstant,whichisdroppedfromtheequationabove.Wecall

thelikelihood,whichistheprobabilityofagivenclass,andcall

theprior,whichistheprobabilityoftheclass.Wecanrepresent

tobetheprobabilityofavectorofattributesconditioningontheclass,asfollows:

Withtheassumptionthattheattributeprobabilities

areindependentgiventheclassc,wehavetheprobabilityofasetofattributesgiventheclasstobetheproductofawholebunchofindependentprobabilities.

HencethebestclassofNaïveBayesclassifierwillbe:

2.2.Parameterextimation

UponobtainingthemathematicalmodeloftheNaïveBayesclassifier,thenextstepistoestimatetheparametersinthemodel–thepriorofeachclassandthedistributionsofattributes.Theprioroftheclass

canbeestimatedbysimplycalculatinganestimatefortheprobabilityoftheclassinourtrainingsample:

Withregardtoestimationofattributedistributions,therearethreemodelsthatarewidelyusedinapplications–GaussianNaïveBayes,MultinomialNaïveBayesandBernoulliNaïveBayes.GaussianNaïveBayesismostlyusedwhendealingwithcontinuousdata,whiletheothertwomodelsiswellsuitedfordiscretedata.FormultinomialNaïveBayes,theprobabilityof

attributehasvalueaconditioningonagivenclass

canbeestimatedbyusingthetrainingdataset:

2.3.MeasurementofNaïveBayesclassifier

Inthiswork,weusetheAccuracytomeasureourmethod.Accuracyequalsthetruepositives,plusthetruenegativesoverallfourclasses(truepositives,truenegatives,falsepositivesandfalsenegatives).Inmanyapplications,accuracyisausefulmeasurefortheNaïveBayesclassifier.Butthereisaparticularscenariowhendealingwiththingsthatareuncommon,inwhichtheaccuracyisnotausefulmeasure.Forexample,99.99%ofthedataarefromcategoryA,whileonly0.01%isfromthecounterpartcategoryB.So,wewillhave99.99%ofaccuracyevenwhenourclassifierassignedalldataintocategoryB,whichisapparentlyundesired.Forthissituation,precisionandrecallareusedasthemeasurementofourclassifier.

3.ExperimentsandResults

3.1.Experimentalenvironments

Asweallknow,Hadooprunsonlinuxorunixoperatingsystem.However,myoperatingsystemiswindows.AvirtualmachinecanhelpmefabricatealinuxoperatingsystemlikeUbuntu.IuseVirtualBoxhere,butotherpeopleprobablyuseVMwareorotherbettermachine.Fig.2givestheresultachievedintheend.TherealCPUisInter(R)Core(TM)i3-2330M2.20GHzandthevirtualMemoryis1024MB.Andtable1givesadescriptionaboutourexperimentalenvironments.

Table1Experimentalenvironments

OperatingSystem

Ubuntu12.4(1GMemory,2.20GHz)

JRE

JDK1.6

HadoopVersion

Hadoop-0.20.2

EclipseVersion

Eclipse-SDK-3.3.1

Figure2.Experimentalenvironmentsaboutthiswork

3.2.Experimentdatasets

Inthisword,weusethedatasetformUCIprovidedbytheteachertotesttheaccuracyandspeedofourmethodrespectively.Theoriginaldataisdividedintotrainingandtestingset.Figure3giveasimpledescriptionforthefirstdatasetusedtotesttheaccuracy.

Figure3.thefirstdatasetusedtotesttheaccuracy

Intheoriginaldata,the“1.txt”contains5,000,000trainingsamples.Itcontains102columns.ThefirstcolumnisID,the2ndtothe101thcolumnistheattributes,andthelastcolumnistheclassification.The“2.txt”contains500,000samplestobeclassified.Itcontains101columns,whichisthesamestructuretothe“1.txt”file’sfirst101columns.

3.3.Experimentalresults

TheresultaccuracyrateforpartialrunisshowninFigure4.

Figure4.Partialresultaboutaccuracy

Wecanseethattheaccuracyrateisnotveryhigh.Wecansaythatthereisnobiasforeachtrainingset.Astheerrorrateofthethirdrunisslightlylowerthantherestoftheruns,weusetheestimatedmodelofthisruntopredicttheclassificationofthe“unknown”data.Andthecomparisonresultofclasslabelisshowninfigure5.

Figure5.Comparisonresultofclasslabel

Besides,theclasstimesisshowninfigure6.

Figure6.Theclasstimes

4.Conclusions

Inthiswork,webuiltaNaïveBayesclassifierbyleveragingtheMapReducefunctionandalsoperformascalabilityanalysistoseetherelationshipbetweentherunningtimeandthesizeofthecluster.Itturnsoutthat,withoutreducingtheaccuracy,adistributedNaïveBayesclassifierhasamuchhigherperformancecomparedwiththerunningofthealgorithmonasinglemachine.AMapReduceversionoftheNaïveBayesclassifierturnsouttobeextremelyefficientwhendealingwithlargeamountofdata.

OurworkinthispaperisonlyasmallsteptowardsleveragingmachinelearningalgorithmsusingtheMapReducemodel.Inthefuture,onedirectioncanbeexperimentingwithmoremachinelearningalgorithmsusingMapReduceandusingMahoutAPItobenchmarkourexperiments.TheotheroptioncouldbetounderstandmoreinternalsystemdesignofHadoopframeworksotobetterutilizeourclusterresourcesforthejobwecreated,forexample,investigatingtheoptimalnumberofmappersandreducersusedforajobtomaximizethethroughputofourlarge-scaledataprocessingjobs.

Appendix

1

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > PPT模板 > 其它模板

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1