Hadoop FAQ.docx

上传人:b****5 文档编号:5383026 上传时间:2022-12-15 格式:DOCX 页数:8 大小:21.39KB
下载 相关 举报
Hadoop FAQ.docx_第1页
第1页 / 共8页
Hadoop FAQ.docx_第2页
第2页 / 共8页
Hadoop FAQ.docx_第3页
第3页 / 共8页
Hadoop FAQ.docx_第4页
第4页 / 共8页
Hadoop FAQ.docx_第5页
第5页 / 共8页
点击查看更多>>
下载资源
资源描述

Hadoop FAQ.docx

《Hadoop FAQ.docx》由会员分享,可在线阅读,更多相关《Hadoop FAQ.docx(8页珍藏版)》请在冰豆网上搜索。

Hadoop FAQ.docx

HadoopFAQ

HadoopFAQ

5. HowcanIhelptomakeHadoopbetter?

IfyouhavetroublefiguringhowtouseHadoop,then,onceyou'vefiguredsomethingout(perhapswiththehelpofthe mailinglists),passthatknowledgeontoothersbyaddingsomethingtothiswiki.

Ifyoufindsomethingthatyouwishweredonebetter,andknowhowtofixit,read HowToContribute,andcontributeapatch.

6. HDFS.IfIaddnewdata-nodestotheclusterwillHDFSmovetheblockstothenewlyaddednodesinordertobalancediskspaceutilizationbetweenthenodes?

No,HDFSwillnotmoveblockstonewnodesautomatically.However,newlycreatedfileswilllikelyhavetheirblocksplacedonthenewnodes.

Thereareseveralwaystorebalancetheclustermanually.

1.Selectasubsetoffilesthattakeupagoodpercentageofyourdiskspace;copythemtonewlocationsinHDFS;removetheoldcopiesofthefiles;renamethenewcopiestotheiroriginalnames.

2.Asimplerway,withnointerruptionofservice,istoturnupthereplicationoffiles,waitfortransferstostabilize,andthenturnthereplicationbackdown.

3.Yetanotherwaytore-balanceblocksistoturnoffthedata-node,whichisfull,waituntilitsblocksarereplicated,andthenbringitbackagain.Theover-replicatedblockswillberandomlyremovedfromdifferentnodes,soyoureallygetthemrebalancednotjustremovedfromthecurrentnode.

4.Finally,youcanusethebin/start-balancer.shcommandtorunabalancingprocesstomoveblocksaroundtheclusterautomatically.See

oHDFSUserGuide:

Rebalancer;

oHDFSTutorial:

Rebalancing;

oHDFSCommandsGuide:

balancer.

7. HDFS.Whatisthepurposeofthesecondaryname-node?

Theterm"secondaryname-node"issomewhatmisleading. Itisnotaname-nodeinthesensethatdata-nodescannotconnecttothesecondaryname-node, andinnoeventitcanreplacetheprimaryname-nodeincaseofitsfailure.

Theonlypurposeofthesecondaryname-nodeistoperformperiodiccheckpoints. Thesecondaryname-nodeperiodicallydownloadscurrentname-nodeimageandeditslogfiles, joinsthemintonewimageanduploadsthenewimagebacktothe(primaryandtheonly)name-node. See UserGuide.

Soifthename-nodefailsandyoucanrestartitonthesamephysicalnodethenthereisnoneed toshutdowndata-nodes,justthename-nodeneedtoberestarted. Ifyoucannotusetheoldnodeanymoreyouwillneedtocopythelatestimagesomewhereelse. Thelatestimagecanbefoundeitheronthenodethatusedtobetheprimarybeforefailureifavailable; oronthesecondaryname-node.Thelatterwillbethelatestcheckpointwithoutsubsequenteditslogs, thatisthemostrecentnamespacemodificationsmaybemissingthere. Youwillalsoneedtorestartthewholeclusterinthiscase.

8. MR.WhatistheDistributedCacheusedfor?

Thedistributedcacheisusedtodistributelargeread-onlyfilesthatareneededbymap/reducejobstothecluster.Theframeworkwillcopythenecessaryfilesfromaurl(eitherhdfs:

or http:

) ontotheslavenodebeforeanytasksforthejobareexecutedonthatnode.Thefilesareonlycopiedonceperjobandsoshouldnotbemodifiedbytheapplication.

9. MR.CanIwritecreate/write-tohdfsfilesdirectlyfrommymap/reducetasks?

Yes.(Clearly,youwantthissinceyouneedtocreate/write-tofilesotherthantheoutput-filewrittenoutby OutputCollector.)

Caveats:

${mapred.output.dir}istheeventualoutputdirectoryforthejob(JobConf.setOutputPath / JobConf.getOutputPath).

${taskid}istheactualidoftheindividualtask-attempt(e.g.task_200709221812_0001_m_000000_0),aTIPisabunchof${taskid}s(e.g.task_200709221812_0001_m_000000).

With speculative-execution on,onecouldfaceissueswith2instancesofthesameTIP(runningsimultaneously)tryingtoopen/write-tothesamefile(path)onhdfs.Hencetheapp-writerwillhavetopickuniquenames(e.g.usingthecompletetaskidi.e.task_200709221812_0001_m_000000_0)pertask-attempt,notjustperTIP.(Clearly,thisneedstobedoneeveniftheuserdoesn'tcreate/write-tofilesdirectlyviareducetasks.)

Togetaroundthistheframeworkhelpstheapplication-writeroutbymaintainingaspecial ${mapred.output.dir}/_${taskid} sub-dirforeachtask-attemptonhdfswheretheoutputofthereducetask-attemptgoes.Onsuccessfulcompletionofthetask-attemptthefilesinthe${mapred.output.dir}/_${taskid}(ofthesuccessfultaskidonly)aremovedto${mapred.output.dir}.Ofcourse,theframeworkdiscardsthesub-directoryofunsuccessfultask-attempts.Thisiscompletelytransparenttotheapplication.

Theapplication-writercantakeadvantageofthisbycreatinganyside-filesrequiredin${mapred.output.dir}duringexecutionofhisreduce-task,andtheframeworkwillmovethemoutsimilarly-thusyoudon'thavetopickuniquepathspertask-attempt.

Fine-print:

thevalueof${mapred.output.dir}duringexecutionofaparticulartask-attemptisactually${mapred.output.dir}/_{$taskid},notthevaluesetby JobConf.setOutputPath. So,justcreateanyhdfsfilesyouwantin${mapred.output.dir}fromyourreducetasktotakeadvantageofthisfeature.

Theentirediscussionholdstrueformapsofjobswithreducer=NONE(i.e.0reduces)sinceoutputofthemap,inthatcase,goesdirectlytohdfs.

10. MR.HowdoIgeteachofmymapstoworkononecompleteinput-fileandnotallowtheframeworktosplit-upmyfiles?

Essentiallyajob'sinputisrepresentedbythe InputFormat(interface)/FileInputFormat(baseclass).

Forthispurposeonewouldneeda'non-splittable' FileInputFormat i.e.aninput-formatwhichessentiallytellsthemap-reduceframeworkthatitcannotbesplit-upandprocessed.Todothisyouneedyourparticularinput-formattoreturn false forthe isSplittable call.

E.g. org.apache.hadoop.mapred.SortValidator.RecordStatsChecker.NonSplitableSequenceFileInputFormat in src/test/org/apache/hadoop/mapred/SortValidator.java

Inadditiontoimplementingthe InputFormat interfaceandhavingisSplitable(...)returningfalse,itisalsonecessarytoimplementthe RecordReaderinterfaceforreturningthewholecontentoftheinputfile.(defaultis LineRecordReader,whichsplitsthefileintoseparatelines)

Theother,quick-fixoption,istoset mapred.min.split.size tolargeenoughvalue.

11. WhyIdoseebrokenimagesinjobdetails.jsppage?

Inhadoop-0.15,Map/Reducetaskcompletiongraphicsareadded.ThegraphsareproducedasSVG(ScalableVectorGraphics)images,whicharebasicallyxmlfiles,embeddedinhtmlcontent.ThegraphicsaretestedsuccessfullyinFirefox2onUbuntuandMACOS.Howeverforotherbrowsers,oneshouldinstallanadditionalplugintothebrowsertoseetheSVGimages.Adobe'sSVGViewercanbefoundat 

12. HDFS.Doesthename-nodestayinsafemodetillallunder-replicatedfilesarefullyreplicated?

No.Duringsafemodereplicationofblocksisprohibited. Thename-nodeawaitswhenallormajorityofdata-nodesreporttheirblocks.

Dependingonhowsafemodeparametersareconfiguredthename-nodewillstayinsafemode untilaspecificpercentageofblocksofthesystemisminimally replicated dfs.replication.min. Ifthesafemodethreshold dfs.safemode.threshold.pct issetto1thenallblocksofall filesshouldbeminimallyreplicated.

Minimalreplicationdoesnotmeanfullreplication.Somereplicasmaybemissingandin ordertoreplicatethemthename-nodeneedstoleavesafemode.

Learnmoreaboutsafemode here.

13. MR.Iseeamaximumof2maps/reducesspawnedconcurrentlyoneachTaskTracker,howdoIincreasethat?

Usetheconfigurationknob:

 mapred.tasktracker.map.tasks.maximum and mapred.tasktracker.reduce.tasks.maximum tocontrolthenumberofmaps/reducesspawnedsimultaneouslyonaTaskTracker.Bydefault,itissetto 2,henceoneseesamaximumof2mapsand2reducesatagiveninstanceonaTaskTracker.

Youcansetthoseonaper-tasktrackerbasistoaccuratelyreflectyourhardware(i.e.setthosetohighernos.onabeefiertasktrackeretc.).

14. MR.Submittingmap/reducejobsasadifferentuserdoesn'twork.

Theproblemisthatyouhaven'tconfiguredyourmap/reducesystem directorytoafixedvalue.Thedefaultworksforsinglenodesystems,butnotfor"real"clusters.Iliketouse:

mapred.system.dir

/hadoop/mapred/system

TheshareddirectorywhereMapReducestorescontrolfiles.

Notethatthisdirectoryisinyourdefaultfilesystemandmustbe accessiblefromboththeclientandservermachinesandistypically inHDFS.

15. HDFS.HowdoIsetupahadoopnodetousemultiplevolumes?

Data-nodes canstoreblocksinmultipledirectoriestypicallyallocatedondifferentlocaldiskdrives. Inordertosetupmultipledirectoriesoneneedstospecifyacommaseparatedlistofpathnamesasavalueof theconfigurationparameter dfs.data.dir. Data-nodeswillattempttoplaceequalamountofdataineachofthedirectories.

The name-node alsosupportsmultipledirectories,whichinthecasestorethenamespaceimageandtheeditslog. Thedirectoriesarespecifiedviathedfs.name.dir configurationparameter. Thename-nodedirectoriesareusedforthenamespacedatareplicationsothattheimageandthe logcouldberestoredfromtheremainingvolumesifoneofthemfails.

16. HDFS.WhathappensifoneHadoopclientrenamesafileoradirectorycontainingthisfilewhileanotherclientisstillwritingintoit?

Startingwithreleasehadoop-0.15,afilewillappearinthenamespaceassoonasitiscreated. Ifawriteriswritingtoafileandanotherclientrenameseitherthefileitselforanyofitspath components,thentheoriginalwriterwillg

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 初中教育 > 学科竞赛

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1