1、HadoopArchives详解Hadoop har文件系统详解Hadoop har文件系统详解 11. har文件系统概述 21.1 har文件系统的用途或目的 21.2 har文件系统的结构与组织 21.3 har文件系统的使用方法 21.4 har文件系统的不足 22.har文件系统的类分析 32.1 HadoopArchives类分析 32.1.1功能介绍 32.1.2 程序流程介绍(类图分析) 32.1.3 涉及的相关类和工具的介绍(类图分析) 162.2 HarFileSystem类分析 172.2.1功能介绍 172.2.2 读文件功能分析 171.har文件系统概述介绍har文
2、件及其文件系统。1.1 har文件系统的用途或目的1.2 har文件系统的结构与组织1.3 har文件系统的使用方法1.4 har文件系统的不足2.har文件系统的类分析2.1 HadoopArchives类分析 这个类的成员变量重要的有conf,加类图分析。2.1.1功能介绍 HadoopArchives类的主要功能是为了生成har文件,它是一个工具类,实现了Tools接口。它的实行过程其实是执行一个MapReduce作业。 生成har文件主要有四个过程:run方法,archive方法,Map过程和Reduce过程。具体请看下一节分析。2.1.2 程序流程介绍(类图分析) 程序最开始由Had
3、oopArchives的main函数开始,通过ToolRunner这个类来调用HadoopArchives的run方法。 run方法的主要作用是判断和提取命令行参数,将命令行参数转换成输入目录和输出目录传递给archive方法。 下面来具体看一下run方法: /* the main driver for creating the archives * it takes at least two command line parameters. The src and the * dest. It does an lsr on the source paths. * The mapper cre
4、ated archuves and the reducer creates * the archive index. */ public int run(String args) throws Exception try List srcPaths = new ArrayList(); Path destPath = null; / check we were supposed to archive or / unarchive String archiveName = null; if (args.length 4) System.out.println(usage); throw new
5、IOException(Invalid usage.); if (!-archiveName.equals(args0) System.out.println(usage); throw new IOException(Archive Name not specified.); archiveName = args1; if (!checkValidName(archiveName) System.out.println(usage); throw new IOException(Invalid name for archives. + archiveName); for (int i = 2
6、; i args.length; i+) if (i = (args.length - 1) destPath = new Path(argsi); else srcPaths.add(new Path(argsi); if (srcPaths.size() = 0) System.out.println(usage); throw new IOException(Invalid Usage: No input sources specified.); / do a glob on the srcPaths and then pass it on List globPaths = new Ar
7、rayList(); for (Path p: srcPaths) FileSystem fs = p.getFileSystem(getConf(); FileStatus statuses = fs.globStatus(p); for (FileStatus status: statuses) globPaths.add(fs.makeQualified(status.getPath(); archive(globPaths, archiveName, destPath); catch(IOException ie) System.err.println(ie.getLocalizedM
8、essage(); return -1; return 0; 上面标红的代码是为了将输入路径(由输入参数而来)补全,加上scheme和authority,形成完整的路径。Then pass it on. 这个方法主要是为了提取globPaths、archiveName和destPath三个变量,作为参数传给archive方法,这个变量的意思很简单,globPaths:所有的输入路径,archiveName:生成har文件的文件名,destPath:输出目录,存放以archiveName为名的文件的目录。 在提取这三个参数的时候,做了一些相关错误检查。具体可以想见代码。 下面来看archive方
9、法,archive方法主要做了三件事,一是配置相关作业参数;二是根据输入文件生成SequenceFile,它来保存所有和输入文件相关的目录信息,它会作为Map的输入;最后一件就是启动作业。 下面看一下archive方法的源码:/*archive the given source paths into * the dest * param srcPaths the src paths to be archived * param dest the dest dir that will contain the archive */ public void archive(List srcPaths
10、, String archiveName, Path dest) throws IOException checkPaths(conf, srcPaths); int numFiles = 0; long totalSize = 0; conf.set(DST_HAR_LABEL, archiveName); Path outputPath = new Path(dest, archiveName); FileOutputFormat.setOutputPath(conf, outputPath); FileSystem outFs = outputPath.getFileSystem(con
11、f); if (outFs.exists(outputPath) | outFs.isFile(dest) throw new IOException(Invalid Output.); conf.set(DST_DIR_LABEL, outputPath.toString(); final String randomId = DistCp.getRandomId(); Path jobDirectory = new Path(new JobClient(conf).getSystemDir(), NAME + _ + randomId); conf.set(JOB_DIR_LABEL, jo
12、bDirectory.toString(); /get a tmp directory for input splits FileSystem jobfs = jobDirectory.getFileSystem(conf); jobfs.mkdirs(jobDirectory); Path srcFiles = new Path(jobDirectory, _har_src_files); conf.set(SRC_LIST_LABEL, srcFiles.toString(); SequenceFile.Writer srcWriter = SequenceFile.createWrite
13、r(jobfs, conf, srcFiles, LongWritable.class, Text.class, SequenceFile.CompressionType.NONE); / get the list of files / create single list of files and dirs try / write the top level dirs in first writeTopLevelDirs(srcWriter, srcPaths); srcWriter.sync(); / these are the input paths passed / from the
14、command line / we do a recursive ls on these paths / and then write them to the input file / one at a time for (Path src: srcPaths) FileSystem fs = src.getFileSystem(conf); ArrayList allFiles = new ArrayList(); recursivels(fs, src, allFiles); for (FileStatus stat: allFiles) String toWrite = ; long l
15、en = stat.isDir()? 0:stat.getLen(); if (stat.isDir() toWrite = + fs.makeQualified(stat.getPath() + dir ; /get the children FileStatus list = fs.listStatus(stat.getPath(); StringBuffer sbuff = new StringBuffer(); sbuff.append(toWrite); for (FileStatus stats: list) sbuff.append(stats.getPath().getName
16、() + ); toWrite = sbuff.toString(); else toWrite += fs.makeQualified(stat.getPath() + file ; srcWriter.append(new LongWritable(len), new Text(toWrite); srcWriter.sync(); numFiles+; totalSize += len; finally srcWriter.close(); /increase the replication of src files jobfs.setReplication(srcFiles, (sho
17、rt) 10); conf.setInt(SRC_COUNT_LABEL, numFiles); conf.setLong(TOTAL_SIZE_LABEL, totalSize); int numMaps = (int)(totalSize/partSize); /run atleast one map. conf.setNumMapTasks(numMaps = 0? 1:numMaps); conf.setNumReduceTasks(1); conf.setInputFormat(HArchiveInputFormat.class); conf.setOutputFormat(Null
18、OutputFormat.class); conf.setMapperClass(HArchivesMapper.class); conf.setReducerClass(HArchivesReducer.class); conf.setMapOutputKeyClass(IntWritable.class); conf.setMapOutputValueClass(Text.class); conf.set(hadoop.job.history.user.location, none); FileInputFormat.addInputPath(conf, jobDirectory); /m
19、ake sure no speculative execution is done conf.setSpeculativeExecution(false); JobClient.runJob(conf); /delete the tmp job directory try jobfs.delete(jobDirectory, true); catch(IOException ie) LOG.info(Unable to clean tmp directory + jobDirectory); 标红处fs.makeQualified(stat.getPath()是为了将路径补全成全路径。可能是因
20、为再在获取子目录和子目录中的文件是用的getPath方法,导致只获取了目录,相关的要看一下Path类和URI类。首先是检查输入路径,之后开始配置一些参数,配置参数分成两种,一是conf的配置,二是MapReduce的作业输入输出路径的配置。通过conf的配置中,有和一般MapReduce一样的配置过程,这部分就不说了,还有几个参数比较重要这里说一下,conf.set(DST_HAR_LABEL, archiveName);conf.set(DST_DIR_LABEL, outputPath.toString();conf.set(JOB_DIR_LABEL, jobDirectory.toSt
21、ring(); conf.set(SRC_LIST_LABEL, srcFiles.toString(); conf.setInt(SRC_COUNT_LABEL, numFiles);conf.setLong(TOTAL_SIZE_LABEL, totalSize); 这些参数的意思也比较清楚,可以结合代码看一下。为什么要设置他们,是因为将在后面的程序用到他们,比如说在获取输入分片的时候,这个我们后面再说。 FileOutputFormat.setOutputPath(conf, outputPath);FileInputFormat.addInputPath(conf, jobDirect
22、ory);MapReduce作业配置这里略过。之后我们来看一下,它是如何生成Map的输入文件,它被放到了srcFiles里面,Path srcFiles = new Path(jobDirectory, _har_src_files);它容的格式是这样的: 文件大小(目录为0) + 路径名 + dir or file + 子目录(如果是目录的话并且有子目录,子目录只有当前目录下的名字,即不是全文路径,都是相对于当前目录的) 它是一个SequenceFile,生成它主要有两个方法:writeTopLevelDirs和recursivels。这两个方法的作用是:writeTopLevelDirs,
23、提取所有输入文件和目录的父目录的每一层目录信息,例如/a/b/c/d.txt,会提取/,/a/,/a/b/,/a/b/c/;recursivels,递归获取当前目录下的所有文件和目录信息,包括当前目录,如果是文件,就只获得当文件的信息。在输入文件流写完和一起配置完成之后,方法会启动一个MapReduce作业,作业完成之后方法会删除工作目录。在分析Map过程前,我们先来分析一下这个MapReduce的作业的输入格式,conf.setInputFormat(HArchiveInputFormat.class); /* * Input format of a hadoop archive job r
24、esponsible for * generating splits of the file list */ static class HArchiveInputFormat implements InputFormat /generate input splits from the src file lists public InputSplit getSplits(JobConf jconf, int numSplits) throws IOException String srcfilelist = jconf.get(SRC_LIST_LABEL, ); if (.equals(src
25、filelist) throw new IOException(Unable to get the + src file for archive generation.); long totalSize = jconf.getLong(TOTAL_SIZE_LABEL, -1); if (totalSize = -1) throw new IOException(Invalid size of files to archive); /we should be safe since this is set by our own code Path src = new Path(srcfileli
26、st); FileSystem fs = src.getFileSystem(jconf); FileStatus fstatus = fs.getFileStatus(src); ArrayList splits = new ArrayList(numSplits); LongWritable key = new LongWritable(); Text value = new Text(); SequenceFile.Reader reader = null; / the remaining bytes in the file split long remaining = fstatus.
27、getLen(); / the count of sizes calculated till now long currentCount = 0L; / the endposition of the split long lastPos = 0L; / the start position of the split long startPos = 0L; long targetSize = totalSize/numSplits; / create splits of size target size so that all the maps / have equals sized data
28、to read and write to. try reader = new SequenceFile.Reader(fs, src, jconf); while(reader.next(key, value) if (currentCount + key.get() targetSize & currentCount != 0) long size = lastPos - startPos; splits.add(new FileSplit(src, startPos, size, (String) null); remaining = remaining - size; startPos
29、= lastPos; currentCount = 0L; currentCount += key.get(); lastPos = reader.getPosition(); / the remaining not equal to the target size. if (remaining != 0) splits.add(new FileSplit(src, startPos, remaining, (String)null); finally reader.close(); return splits.toArray(new FileSplitsplits.size(); public RecordReader getRecordReader(InputSplit split, JobConf job, Reporter reporter) throws IOException return new SequenceFileRecordReader(job, (FileSplit)split); 这个输入格式,主要就是为了提供获取分片的功能,它的读功能代理给了SequenceFile做。下面我们来看一下它是如何生成分片的。输入文件是SequenceFile格式的,所以它是可划分的。 首先它根据配置文件读取输入文件的路径,这是在archive方法中配置的,当时
copyright@ 2008-2022 冰豆网网站版权所有
经营许可证编号:鄂ICP备2022015515号-1