Hadoop FAQ.docx
《Hadoop FAQ.docx》由会员分享,可在线阅读,更多相关《Hadoop FAQ.docx(8页珍藏版)》请在冰豆网上搜索。
HadoopFAQ
HadoopFAQ
5. HowcanIhelptomakeHadoopbetter?
IfyouhavetroublefiguringhowtouseHadoop,then,onceyou'vefiguredsomethingout(perhapswiththehelpofthe mailinglists),passthatknowledgeontoothersbyaddingsomethingtothiswiki.
Ifyoufindsomethingthatyouwishweredonebetter,andknowhowtofixit,read HowToContribute,andcontributeapatch.
6. HDFS.IfIaddnewdata-nodestotheclusterwillHDFSmovetheblockstothenewlyaddednodesinordertobalancediskspaceutilizationbetweenthenodes?
No,HDFSwillnotmoveblockstonewnodesautomatically.However,newlycreatedfileswilllikelyhavetheirblocksplacedonthenewnodes.
Thereareseveralwaystorebalancetheclustermanually.
1.Selectasubsetoffilesthattakeupagoodpercentageofyourdiskspace;copythemtonewlocationsinHDFS;removetheoldcopiesofthefiles;renamethenewcopiestotheiroriginalnames.
2.Asimplerway,withnointerruptionofservice,istoturnupthereplicationoffiles,waitfortransferstostabilize,andthenturnthereplicationbackdown.
3.Yetanotherwaytore-balanceblocksistoturnoffthedata-node,whichisfull,waituntilitsblocksarereplicated,andthenbringitbackagain.Theover-replicatedblockswillberandomlyremovedfromdifferentnodes,soyoureallygetthemrebalancednotjustremovedfromthecurrentnode.
4.Finally,youcanusethebin/start-balancer.shcommandtorunabalancingprocesstomoveblocksaroundtheclusterautomatically.See
oHDFSUserGuide:
Rebalancer;
oHDFSTutorial:
Rebalancing;
oHDFSCommandsGuide:
balancer.
7. HDFS.Whatisthepurposeofthesecondaryname-node?
Theterm"secondaryname-node"issomewhatmisleading. Itisnotaname-nodeinthesensethatdata-nodescannotconnecttothesecondaryname-node, andinnoeventitcanreplacetheprimaryname-nodeincaseofitsfailure.
Theonlypurposeofthesecondaryname-nodeistoperformperiodiccheckpoints. Thesecondaryname-nodeperiodicallydownloadscurrentname-nodeimageandeditslogfiles, joinsthemintonewimageanduploadsthenewimagebacktothe(primaryandtheonly)name-node. See UserGuide.
Soifthename-nodefailsandyoucanrestartitonthesamephysicalnodethenthereisnoneed toshutdowndata-nodes,justthename-nodeneedtoberestarted. Ifyoucannotusetheoldnodeanymoreyouwillneedtocopythelatestimagesomewhereelse. Thelatestimagecanbefoundeitheronthenodethatusedtobetheprimarybeforefailureifavailable; oronthesecondaryname-node.Thelatterwillbethelatestcheckpointwithoutsubsequenteditslogs, thatisthemostrecentnamespacemodificationsmaybemissingthere. Youwillalsoneedtorestartthewholeclusterinthiscase.
8. MR.WhatistheDistributedCacheusedfor?
Thedistributedcacheisusedtodistributelargeread-onlyfilesthatareneededbymap/reducejobstothecluster.Theframeworkwillcopythenecessaryfilesfromaurl(eitherhdfs:
or http:
) ontotheslavenodebeforeanytasksforthejobareexecutedonthatnode.Thefilesareonlycopiedonceperjobandsoshouldnotbemodifiedbytheapplication.
9. MR.CanIwritecreate/write-tohdfsfilesdirectlyfrommymap/reducetasks?
Yes.(Clearly,youwantthissinceyouneedtocreate/write-tofilesotherthantheoutput-filewrittenoutby OutputCollector.)
Caveats:
${mapred.output.dir}istheeventualoutputdirectoryforthejob(JobConf.setOutputPath / JobConf.getOutputPath).
${taskid}istheactualidoftheindividualtask-attempt(e.g.task_200709221812_0001_m_000000_0),aTIPisabunchof${taskid}s(e.g.task_200709221812_0001_m_000000).
With speculative-execution on,onecouldfaceissueswith2instancesofthesameTIP(runningsimultaneously)tryingtoopen/write-tothesamefile(path)onhdfs.Hencetheapp-writerwillhavetopickuniquenames(e.g.usingthecompletetaskidi.e.task_200709221812_0001_m_000000_0)pertask-attempt,notjustperTIP.(Clearly,thisneedstobedoneeveniftheuserdoesn'tcreate/write-tofilesdirectlyviareducetasks.)
Togetaroundthistheframeworkhelpstheapplication-writeroutbymaintainingaspecial ${mapred.output.dir}/_${taskid} sub-dirforeachtask-attemptonhdfswheretheoutputofthereducetask-attemptgoes.Onsuccessfulcompletionofthetask-attemptthefilesinthe${mapred.output.dir}/_${taskid}(ofthesuccessfultaskidonly)aremovedto${mapred.output.dir}.Ofcourse,theframeworkdiscardsthesub-directoryofunsuccessfultask-attempts.Thisiscompletelytransparenttotheapplication.
Theapplication-writercantakeadvantageofthisbycreatinganyside-filesrequiredin${mapred.output.dir}duringexecutionofhisreduce-task,andtheframeworkwillmovethemoutsimilarly-thusyoudon'thavetopickuniquepathspertask-attempt.
Fine-print:
thevalueof${mapred.output.dir}duringexecutionofaparticulartask-attemptisactually${mapred.output.dir}/_{$taskid},notthevaluesetby JobConf.setOutputPath. So,justcreateanyhdfsfilesyouwantin${mapred.output.dir}fromyourreducetasktotakeadvantageofthisfeature.
Theentirediscussionholdstrueformapsofjobswithreducer=NONE(i.e.0reduces)sinceoutputofthemap,inthatcase,goesdirectlytohdfs.
10. MR.HowdoIgeteachofmymapstoworkononecompleteinput-fileandnotallowtheframeworktosplit-upmyfiles?
Essentiallyajob'sinputisrepresentedbythe InputFormat(interface)/FileInputFormat(baseclass).
Forthispurposeonewouldneeda'non-splittable' FileInputFormat i.e.aninput-formatwhichessentiallytellsthemap-reduceframeworkthatitcannotbesplit-upandprocessed.Todothisyouneedyourparticularinput-formattoreturn false forthe isSplittable call.
E.g. org.apache.hadoop.mapred.SortValidator.RecordStatsChecker.NonSplitableSequenceFileInputFormat in src/test/org/apache/hadoop/mapred/SortValidator.java
Inadditiontoimplementingthe InputFormat interfaceandhavingisSplitable(...)returningfalse,itisalsonecessarytoimplementthe RecordReaderinterfaceforreturningthewholecontentoftheinputfile.(defaultis LineRecordReader,whichsplitsthefileintoseparatelines)
Theother,quick-fixoption,istoset mapred.min.split.size tolargeenoughvalue.
11. WhyIdoseebrokenimagesinjobdetails.jsppage?
Inhadoop-0.15,Map/Reducetaskcompletiongraphicsareadded.ThegraphsareproducedasSVG(ScalableVectorGraphics)images,whicharebasicallyxmlfiles,embeddedinhtmlcontent.ThegraphicsaretestedsuccessfullyinFirefox2onUbuntuandMACOS.Howeverforotherbrowsers,oneshouldinstallanadditionalplugintothebrowsertoseetheSVGimages.Adobe'sSVGViewercanbefoundat
12. HDFS.Doesthename-nodestayinsafemodetillallunder-replicatedfilesarefullyreplicated?
No.Duringsafemodereplicationofblocksisprohibited. Thename-nodeawaitswhenallormajorityofdata-nodesreporttheirblocks.
Dependingonhowsafemodeparametersareconfiguredthename-nodewillstayinsafemode untilaspecificpercentageofblocksofthesystemisminimally replicated dfs.replication.min. Ifthesafemodethreshold dfs.safemode.threshold.pct issetto1thenallblocksofall filesshouldbeminimallyreplicated.
Minimalreplicationdoesnotmeanfullreplication.Somereplicasmaybemissingandin ordertoreplicatethemthename-nodeneedstoleavesafemode.
Learnmoreaboutsafemode here.
13. MR.Iseeamaximumof2maps/reducesspawnedconcurrentlyoneachTaskTracker,howdoIincreasethat?
Usetheconfigurationknob:
mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum tocontrolthenumberofmaps/reducesspawnedsimultaneouslyonaTaskTracker.Bydefault,itissetto 2,henceoneseesamaximumof2mapsand2reducesatagiveninstanceonaTaskTracker.
Youcansetthoseonaper-tasktrackerbasistoaccuratelyreflectyourhardware(i.e.setthosetohighernos.onabeefiertasktrackeretc.).
14. MR.Submittingmap/reducejobsasadifferentuserdoesn'twork.
Theproblemisthatyouhaven'tconfiguredyourmap/reducesystem directorytoafixedvalue.Thedefaultworksforsinglenodesystems,butnotfor"real"clusters.Iliketouse:
mapred.system.dir
/hadoop/mapred/system
TheshareddirectorywhereMapReducestorescontrolfiles.
Notethatthisdirectoryisinyourdefaultfilesystemandmustbe accessiblefromboththeclientandservermachinesandistypically inHDFS.
15. HDFS.HowdoIsetupahadoopnodetousemultiplevolumes?
Data-nodes canstoreblocksinmultipledirectoriestypicallyallocatedondifferentlocaldiskdrives. Inordertosetupmultipledirectoriesoneneedstospecifyacommaseparatedlistofpathnamesasavalueof theconfigurationparameter dfs.data.dir. Data-nodeswillattempttoplaceequalamountofdataineachofthedirectories.
The name-node alsosupportsmultipledirectories,whichinthecasestorethenamespaceimageandtheeditslog. Thedirectoriesarespecifiedviathedfs.name.dir configurationparameter. Thename-nodedirectoriesareusedforthenamespacedatareplicationsothattheimageandthe logcouldberestoredfromtheremainingvolumesifoneofthemfails.
16. HDFS.WhathappensifoneHadoopclientrenamesafileoradirectorycontainingthisfilewhileanotherclientisstillwritingintoit?
Startingwithreleasehadoop-0.15,afilewillappearinthenamespaceassoonasitiscreated. Ifawriteriswritingtoafileandanotherclientrenameseitherthefileitselforanyofitspath components,thentheoriginalwriterwillg