HadoopArchives详解.docx

资源描述

HadoopArchives详解.docx

《HadoopArchives详解.docx》由会员分享，可在线阅读，更多相关《HadoopArchives详解.docx（35页珍藏版）》请在冰豆网上搜索。

HadoopArchives详解.docx

HadoopArchives详解

Hadoophar文件系统详解

Hadoophar文件系统详解1

1.har文件系统概述2

1.1har文件系统的用途或目的2

1.2har文件系统的结构与组织2

1.3har文件系统的使用方法2

1.4har文件系统的不足2

2.har文件系统的类分析3

2.1HadoopArchives类分析3

2.1.1功能介绍3

2.1.2程序流程介绍（类图分析）3

2.1.3涉及的相关类和工具的介绍（类图分析）16

2.2HarFileSystem类分析17

2.2.1功能介绍17

2.2.2读文件功能分析17

1.har文件系统概述

介绍har文件及其文件系统。

1.1har文件系统的用途或目的

1.2har文件系统的结构与组织

1.3har文件系统的使用方法

1.4har文件系统的不足

2.har文件系统的类分析

2.1HadoopArchives类分析

这个类的成员变量重要的有conf，加类图分析。

2.1.1功能介绍

HadoopArchives类的主要功能是为了生成har文件，它是一个工具类，实现了Tools接口。

它的实行过程其实是执行一个MapReduce作业。

生成har文件主要有四个过程：

run方法，archive方法，Map过程和Reduce过程。

具体请看下一节分析。

2.1.2程序流程介绍（类图分析）

程序最开始由HadoopArchives的main函数开始，通过ToolRunner这个类来调用HadoopArchives的run方法。

run方法的主要作用是判断和提取命令行参数，将命令行参数转换成输入目录和输出目录传递给archive方法。

下面来具体看一下run方法：

/**themaindriverforcreatingthearchives

*ittakesatleasttwocommandlineparameters.Thesrcandthe

*dest.Itdoesanlsronthesourcepaths.

*Themappercreatedarchuvesandthereducercreates

*thearchiveindex.

publicintrun（String[]args）throwsException{

try{

ListsrcPaths=newArrayList（）;

PathdestPath=null;

//checkweweresupposedtoarchiveor

//unarchive

StringarchiveName=null;

if（args.length<4）{

System.out.println（usage）;

thrownewIOException（"Invalidusage."）;

}

if（!

"-archiveName".equals（args[0]））{

System.out.println（usage）;

thrownewIOException（"ArchiveNamenotspecified."）;

}

archiveName=args[1];

if（!

checkValidName（archiveName））{

System.out.println（usage）;

thrownewIOException（"Invalidnameforarchives."+archiveName）;

}

for（inti=2;i

if（i==（args.length-1））{

destPath=newPath（args[i]）;

}

else{

srcPaths.add（newPath（args[i]））;

}

if（srcPaths.size（）==0）{

System.out.println（usage）;

thrownewIOException（"InvalidUsage:

Noinputsourcesspecified."）;

}

//doaglobonthesrcPathsandthenpassiton

ListglobPaths=newArrayList（）;

for（Pathp:

srcPaths）{

FileSystemfs=p.getFileSystem（getConf（））;

FileStatus[]statuses=fs.globStatus（p）;

for（FileStatusstatus:

statuses）{

globPaths.add（fs.makeQualified（status.getPath（）））;

}

archive（globPaths,archiveName,destPath）;

}catch（IOExceptionie）{

System.err.println（ie.getLocalizedMessage（））;

return-1;

}

return0;

}

上面标红的代码是为了将输入路径（由输入参数而来）补全，加上scheme和authority，形成完整的路径。

Thenpassiton.

这个方法主要是为了提取globPaths、archiveName和destPath三个变量，作为参数传给archive方法，这个变量的意思很简单，globPaths：

所有的输入路径，archiveName：

生成har文件的文件名，destPath：

输出目录，存放以archiveName为名的文件的目录。

在提取这三个参数的时候，做了一些相关错误检查。

具体可以想见代码。

下面来看archive方法，archive方法主要做了三件事，一是配置相关作业参数；二是根据输入文件生成SequenceFile,它来保存所有和输入文件相关的目录信息，它会作为Map的输入；最后一件就是启动作业。

下面看一下archive方法的源码：

/**archivethegivensourcepathsinto

*thedest

*paramsrcPathsthesrcpathstobearchived

*paramdestthedestdirthatwillcontainthearchive

publicvoidarchive（ListsrcPaths,StringarchiveName,Pathdest）

throwsIOException{

checkPaths（conf,srcPaths）;

intnumFiles=0;

longtotalSize=0;

conf.set（DST_HAR_LABEL,archiveName）;

PathoutputPath=newPath（dest,archiveName）;

FileOutputFormat.setOutputPath（conf,outputPath）;

FileSystemoutFs=outputPath.getFileSystem（conf）;

if（outFs.exists（outputPath）||outFs.isFile（dest））{

thrownewIOException（"InvalidOutput."）;

}

conf.set（DST_DIR_LABEL,outputPath.toString（））;

finalStringrandomId=DistCp.getRandomId（）;

PathjobDirectory=newPath（newJobClient（conf）.getSystemDir（）,

NAME+"_"+randomId）;

conf.set（JOB_DIR_LABEL,jobDirectory.toString（））;

//getatmpdirectoryforinputsplits

FileSystemjobfs=jobDirectory.getFileSystem（conf）;

jobfs.mkdirs（jobDirectory）;

PathsrcFiles=newPath（jobDirectory,"_har_src_files"）;

conf.set（SRC_LIST_LABEL,srcFiles.toString（））;

SequenceFile.WritersrcWriter=SequenceFile.createWriter（jobfs,conf,

srcFiles,LongWritable.class,Text.class,

SequenceFile.CompressionType.NONE）;

//getthelistoffiles

//createsinglelistoffilesanddirs

try{

//writethetopleveldirsinfirst

writeTopLevelDirs（srcWriter,srcPaths）;

srcWriter.sync（）;

//thesearetheinputpathspassed

//fromthecommandline

//wedoarecursivelsonthesepaths

//andthenwritethemtotheinputfile

//oneatatime

for（Pathsrc:

srcPaths）{

FileSystemfs=src.getFileSystem（conf）;

ArrayListallFiles=newArrayList（）;

recursivels（fs,src,allFiles）;

for（FileStatusstat:

allFiles）{

StringtoWrite="";

longlen=stat.isDir（）?

stat.getLen（）;

if（stat.isDir（））{

toWrite=""+fs.makeQualified（stat.getPath（））+"dir";

//getthechildren

FileStatus[]list=fs.listStatus（stat.getPath（））;

StringBuffersbuff=newStringBuffer（）;

sbuff.append（toWrite）;

for（FileStatusstats:

list）{

sbuff.append（stats.getPath（）.getName（）+""）;

}

toWrite=sbuff.toString（）;

}

else{

toWrite+=fs.makeQualified（stat.getPath（））+"file";

}

srcWriter.append（newLongWritable（len）,new

Text（toWrite））;

srcWriter.sync（）;

numFiles++;

totalSize+=len;

}

}finally{

srcWriter.close（）;

}

//increasethereplicationofsrcfiles

jobfs.setReplication（srcFiles,（short）10）;

conf.setInt（SRC_COUNT_LABEL,numFiles）;

conf.setLong（TOTAL_SIZE_LABEL,totalSize）;

intnumMaps=（int）（totalSize/partSize）;

//runatleastonemap.

conf.setNumMapTasks（numMaps==0?

numMaps）;

conf.setNumReduceTasks

（1）;

conf.setInputFormat（HArchiveInputFormat.class）;

conf.setOutputFormat（NullOutputFormat.class）;

conf.setMapperClass（HArchivesMapper.class）;

conf.setReducerClass（HArchivesReducer.class）;

conf.setMapOutputKeyClass（IntWritable.class）;

conf.setMapOutputValueClass（Text.class）;

conf.set（"hadoop.job.history.user.location","none"）;

FileInputFormat.addInputPath（conf,jobDirectory）;

//makesurenospeculativeexecutionisdone

conf.setSpeculativeExecution（false）;

JobClient.runJob（conf）;

//deletethetmpjobdirectory

try{

jobfs.delete（jobDirectory,true）;

}catch（IOExceptionie）{

LOG.info（"Unabletocleantmpdirectory"+jobDirectory）;

}

标红处

fs.makeQualified（stat.getPath（））

是为了将路径补全成全路径。

可能是因为再在获取子目录和子目录中的文件是用的getPath方法，导致只获取了目录，相关的要看一下Path类和URI类。

首先是检查输入路径，之后开始配置一些参数，配置参数分成两种，一是conf的配置，二是MapReduce的作业输入输出路径的配置。

通过conf的配置中，有和一般MapReduce一样的配置过程，这部分就不说了，还有几个参数比较重要这里说一下，

conf.set（DST_HAR_LABEL,archiveName）;

conf.set（DST_DIR_LABEL,outputPath.toString（））;

conf.set（JOB_DIR_LABEL,jobDirectory.toString（））;

conf.set（SRC_LIST_LABEL,srcFiles.toString（））;

conf.setInt（SRC_COUNT_LABEL,numFiles）;

conf.setLong（TOTAL_SIZE_LABEL,totalSize）;

这些参数的意思也比较清楚，可以结合代码看一下。

为什么要设置他们，是因为将在后面的程序用到他们，比如说在获取输入分片的时候，这个我们后面再说。

FileOutputFormat.setOutputPath（conf,outputPath）;

FileInputFormat.addInputPath（conf,jobDirectory）;

……

MapReduce作业配置这里略过。

之后我们来看一下，它是如何生成Map的输入文件，它被放到了srcFiles里面，

PathsrcFiles=newPath（jobDirectory,"_har_src_files"）;

它容的格式是这样的：

文件大小（目录为0）+路径名+dirorfile+[子目录]（如果是目录的话并且有子目录，子目录只有当前目录下的名字，即不是全文路径，都是相对于当前目录的）

它是一个SequenceFile，生成它主要有两个方法：

writeTopLevelDirs和recursivels。

这两个方法的作用是：

writeTopLevelDirs，提取所有输入文件和目录的父目录的每一层目录信息，例如/a/b/c/d.txt，会提取/，/a/，/a/b/，/a/b/c/；recursivels，递归获取当前目录下的所有文件和目录信息，包括当前目录，如果是文件，就只获得当文件的信息。

在输入文件流写完和一起配置完成之后，方法会启动一个MapReduce作业，作业完成之后方法会删除工作目录。

在分析Map过程前，我们先来分析一下这个MapReduce的作业的输入格式，

conf.setInputFormat（HArchiveInputFormat.class）;

/**

*Inputformatofahadooparchivejobresponsiblefor

*generatingsplitsofthefilelist

staticclassHArchiveInputFormatimplementsInputFormat{

//generateinputsplitsfromthesrcfilelists

publicInputSplit[]getSplits（JobConfjconf,intnumSplits）

throwsIOException{

Stringsrcfilelist=jconf.get（SRC_LIST_LABEL,""）;

if（"".equals（srcfilelist））{

thrownewIOException（"Unabletogetthe"+

"srcfileforarchivegeneration."）;

}

longtotalSize=jconf.getLong（TOTAL_SIZE_LABEL,-1）;

if（totalSize==-1）{

thrownewIOException（"Invalidsizeoffilestoarchive"）;

}

//weshouldbesafesincethisissetbyourowncode

Pathsrc=newPath（srcfilelist）;

FileSystemfs=src.getFileSystem（jconf）;

FileStatusfstatus=fs.getFileStatus（src）;

ArrayListsplits=newArrayList（numSplits）;

LongWritablekey=newLongWritable（）;

Textvalue=newText（）;

SequenceFile.Readerreader=null;

//theremainingbytesinthefilesplit

longremaining=fstatus.getLen（）;

//thecountofsizescalculatedtillnow

longcurrentCount=0L;

//theendpositionofthesplit

longlastPos=0L;

//thestartpositionofthesplit

longstartPos=0L;

longtargetSize=totalSize/numSplits;

//createsplitsofsizetargetsizesothatallthemaps

//haveequalssizeddatatoreadandwriteto.

try{

reader=newSequenceFile.Reader（fs,src,jconf）;

while（reader.next（key,value））{

if（currentCount+key.get（）>targetSize&¤tCount!

=0）{

longsize=lastPos-startPos;

splits.add（newFileSplit（src,startPos,size,（String[]）null））;

remaining=remaining-size;

startPos=lastPos;

currentCount=0L;

}

currentCount+=key.get（）;

lastPos=reader.getPosition（）;

}

//theremainingnotequaltothetargetsize.

if（remaining!

=0）{

splits.add（newFileSplit（src,startPos,remaining,（String[]）null））;

}

finally{

reader.close（）;

}

returnsplits.toArray（newFileSplit[splits.size（）]）;

}

publicRecordReadergetRecordReader（InputSplitsplit,

JobConfjob,Reporterreporter）throwsIOException{

returnnewSequenceFileRecordReader（job,

（FileSplit）split）;

}

这个输入格式，主要就是为了提供获取分片的功能，它的读功能代理给了SequenceFile做。

下面我们来看一下它是如何生成分片的。

输入文件是SequenceFile格式的，所以它是可划分的。

首先它根据配置文件读取输入文件的路径，这是在archive方法中配置的，当时

展开阅读全文