Hadoop FAQ.docx - 冰豆网

资源描述

Hadoop FAQ.docx

《Hadoop FAQ.docx》由会员分享，可在线阅读，更多相关《Hadoop FAQ.docx（8页珍藏版）》请在冰豆网上搜索。

Hadoop FAQ.docx

HadoopFAQ

5. HowcanIhelptomakeHadoopbetter?

IfyouhavetroublefiguringhowtouseHadoop,then,onceyou'vefiguredsomethingout（perhapswiththehelpofthe mailinglists）,passthatknowledgeontoothersbyaddingsomethingtothiswiki.

Ifyoufindsomethingthatyouwishweredonebetter,andknowhowtofixit,read HowToContribute,andcontributeapatch.

6. HDFS.IfIaddnewdata-nodestotheclusterwillHDFSmovetheblockstothenewlyaddednodesinordertobalancediskspaceutilizationbetweenthenodes?

No,HDFSwillnotmoveblockstonewnodesautomatically.However,newlycreatedfileswilllikelyhavetheirblocksplacedonthenewnodes.

Thereareseveralwaystorebalancetheclustermanually.

1.Selectasubsetoffilesthattakeupagoodpercentageofyourdiskspace;copythemtonewlocationsinHDFS;removetheoldcopiesofthefiles;renamethenewcopiestotheiroriginalnames.

2.Asimplerway,withnointerruptionofservice,istoturnupthereplicationoffiles,waitfortransferstostabilize,andthenturnthereplicationbackdown.

3.Yetanotherwaytore-balanceblocksistoturnoffthedata-node,whichisfull,waituntilitsblocksarereplicated,andthenbringitbackagain.Theover-replicatedblockswillberandomlyremovedfromdifferentnodes,soyoureallygetthemrebalancednotjustremovedfromthecurrentnode.

4.Finally,youcanusethebin/start-balancer.shcommandtorunabalancingprocesstomoveblocksaroundtheclusterautomatically.See

oHDFSUserGuide:

Rebalancer;

oHDFSTutorial:

Rebalancing;

oHDFSCommandsGuide:

balancer.

7. HDFS.Whatisthepurposeofthesecondaryname-node?

Theterm"secondaryname-node"issomewhatmisleading. Itisnotaname-nodeinthesensethatdata-nodescannotconnecttothesecondaryname-node, andinnoeventitcanreplacetheprimaryname-nodeincaseofitsfailure.

Theonlypurposeofthesecondaryname-nodeistoperformperiodiccheckpoints. Thesecondaryname-nodeperiodicallydownloadscurrentname-nodeimageandeditslogfiles, joinsthemintonewimageanduploadsthenewimagebacktothe（primaryandtheonly）name-node. See UserGuide.

Soifthename-nodefailsandyoucanrestartitonthesamephysicalnodethenthereisnoneed toshutdowndata-nodes,justthename-nodeneedtoberestarted. Ifyoucannotusetheoldnodeanymoreyouwillneedtocopythelatestimagesomewhereelse. Thelatestimagecanbefoundeitheronthenodethatusedtobetheprimarybeforefailureifavailable; oronthesecondaryname-node.Thelatterwillbethelatestcheckpointwithoutsubsequenteditslogs, thatisthemostrecentnamespacemodificationsmaybemissingthere. Youwillalsoneedtorestartthewholeclusterinthiscase.

8. MR.WhatistheDistributedCacheusedfor?

Thedistributedcacheisusedtodistributelargeread-onlyfilesthatareneededbymap/reducejobstothecluster.Theframeworkwillcopythenecessaryfilesfromaurl（eitherhdfs:

or http:

） ontotheslavenodebeforeanytasksforthejobareexecutedonthatnode.Thefilesareonlycopiedonceperjobandsoshouldnotbemodifiedbytheapplication.

9. MR.CanIwritecreate/write-tohdfsfilesdirectlyfrommymap/reducetasks?

Yes.（Clearly,youwantthissinceyouneedtocreate/write-tofilesotherthantheoutput-filewrittenoutby OutputCollector.）

Caveats:

${mapred.output.dir}istheeventualoutputdirectoryforthejob（JobConf.setOutputPath / JobConf.getOutputPath）.

${taskid}istheactualidoftheindividualtask-attempt（e.g.task_200709221812_0001_m_000000_0）,aTIPisabunchof${taskid}s（e.g.task_200709221812_0001_m_000000）.

With speculative-execution on,onecouldfaceissueswith2instancesofthesameTIP（runningsimultaneously）tryingtoopen/write-tothesamefile（path）onhdfs.Hencetheapp-writerwillhavetopickuniquenames（e.g.usingthecompletetaskidi.e.task_200709221812_0001_m_000000_0）pertask-attempt,notjustperTIP.（Clearly,thisneedstobedoneeveniftheuserdoesn'tcreate/write-tofilesdirectlyviareducetasks.）

Togetaroundthistheframeworkhelpstheapplication-writeroutbymaintainingaspecial ${mapred.output.dir}/_${taskid} sub-dirforeachtask-attemptonhdfswheretheoutputofthereducetask-attemptgoes.Onsuccessfulcompletionofthetask-attemptthefilesinthe${mapred.output.dir}/_${taskid}（ofthesuccessfultaskidonly）aremovedto${mapred.output.dir}.Ofcourse,theframeworkdiscardsthesub-directoryofunsuccessfultask-attempts.Thisiscompletelytransparenttotheapplication.

Theapplication-writercantakeadvantageofthisbycreatinganyside-filesrequiredin${mapred.output.dir}duringexecutionofhisreduce-task,andtheframeworkwillmovethemoutsimilarly-thusyoudon'thavetopickuniquepathspertask-attempt.

Fine-print:

thevalueof${mapred.output.dir}duringexecutionofaparticulartask-attemptisactually${mapred.output.dir}/_{$taskid},notthevaluesetby JobConf.setOutputPath. So,justcreateanyhdfsfilesyouwantin${mapred.output.dir}fromyourreducetasktotakeadvantageofthisfeature.

Theentirediscussionholdstrueformapsofjobswithreducer=NONE（i.e.0reduces）sinceoutputofthemap,inthatcase,goesdirectlytohdfs.

10. MR.HowdoIgeteachofmymapstoworkononecompleteinput-fileandnotallowtheframeworktosplit-upmyfiles?

Essentiallyajob'sinputisrepresentedbythe InputFormat（interface）/FileInputFormat（baseclass）.

Forthispurposeonewouldneeda'non-splittable' FileInputFormat i.e.aninput-formatwhichessentiallytellsthemap-reduceframeworkthatitcannotbesplit-upandprocessed.Todothisyouneedyourparticularinput-formattoreturn false forthe isSplittable call.

E.g. org.apache.hadoop.mapred.SortValidator.RecordStatsChecker.NonSplitableSequenceFileInputFormat in src/test/org/apache/hadoop/mapred/SortValidator.java

Inadditiontoimplementingthe InputFormat interfaceandhavingisSplitable（...）returningfalse,itisalsonecessarytoimplementthe RecordReaderinterfaceforreturningthewholecontentoftheinputfile.（defaultis LineRecordReader,whichsplitsthefileintoseparatelines）

Theother,quick-fixoption,istoset mapred.min.split.size tolargeenoughvalue.

11. WhyIdoseebrokenimagesinjobdetails.jsppage?

Inhadoop-0.15,Map/Reducetaskcompletiongraphicsareadded.ThegraphsareproducedasSVG（ScalableVectorGraphics）images,whicharebasicallyxmlfiles,embeddedinhtmlcontent.ThegraphicsaretestedsuccessfullyinFirefox2onUbuntuandMACOS.Howeverforotherbrowsers,oneshouldinstallanadditionalplugintothebrowsertoseetheSVGimages.Adobe'sSVGViewercanbefoundat

12. HDFS.Doesthename-nodestayinsafemodetillallunder-replicatedfilesarefullyreplicated?

No.Duringsafemodereplicationofblocksisprohibited. Thename-nodeawaitswhenallormajorityofdata-nodesreporttheirblocks.

Dependingonhowsafemodeparametersareconfiguredthename-nodewillstayinsafemode untilaspecificpercentageofblocksofthesystemisminimally replicated dfs.replication.min. Ifthesafemodethreshold dfs.safemode.threshold.pct issetto1thenallblocksofall filesshouldbeminimallyreplicated.

Minimalreplicationdoesnotmeanfullreplication.Somereplicasmaybemissingandin ordertoreplicatethemthename-nodeneedstoleavesafemode.

Learnmoreaboutsafemode here.

13. MR.Iseeamaximumof2maps/reducesspawnedconcurrentlyoneachTaskTracker,howdoIincreasethat?

Usetheconfigurationknob:

mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum tocontrolthenumberofmaps/reducesspawnedsimultaneouslyonaTaskTracker.Bydefault,itissetto 2,henceoneseesamaximumof2mapsand2reducesatagiveninstanceonaTaskTracker.

Youcansetthoseonaper-tasktrackerbasistoaccuratelyreflectyourhardware（i.e.setthosetohighernos.onabeefiertasktrackeretc.）.

14. MR.Submittingmap/reducejobsasadifferentuserdoesn'twork.

Theproblemisthatyouhaven'tconfiguredyourmap/reducesystem directorytoafixedvalue.Thedefaultworksforsinglenodesystems,butnotfor"real"clusters.Iliketouse:

mapred.system.dir

/hadoop/mapred/system

TheshareddirectorywhereMapReducestorescontrolfiles.

Notethatthisdirectoryisinyourdefaultfilesystemandmustbe accessiblefromboththeclientandservermachinesandistypically inHDFS.

15. HDFS.HowdoIsetupahadoopnodetousemultiplevolumes?

Data-nodes canstoreblocksinmultipledirectoriestypicallyallocatedondifferentlocaldiskdrives. Inordertosetupmultipledirectoriesoneneedstospecifyacommaseparatedlistofpathnamesasavalueof theconfigurationparameter dfs.data.dir. Data-nodeswillattempttoplaceequalamountofdataineachofthedirectories.

The name-node alsosupportsmultipledirectories,whichinthecasestorethenamespaceimageandtheeditslog. Thedirectoriesarespecifiedviathedfs.name.dir configurationparameter. Thename-nodedirectoriesareusedforthenamespacedatareplicationsothattheimageandthe logcouldberestoredfromtheremainingvolumesifoneofthemfails.

16. HDFS.WhathappensifoneHadoopclientrenamesafileoradirectorycontainingthisfilewhileanotherclientisstillwritingintoit?

Startingwithreleasehadoop-0.15,afilewillappearinthenamespaceassoonasitiscreated. Ifawriteriswritingtoafileandanotherclientrenameseitherthefileitselforanyofitspath components,thentheoriginalwriterwillg

展开阅读全文