mapreduce博客通俗分析总结文档格式.docx

资源描述

mapreduce博客通俗分析总结文档格式.docx

《mapreduce博客通俗分析总结文档格式.docx》由会员分享，可在线阅读，更多相关《mapreduce博客通俗分析总结文档格式.docx（13页珍藏版）》请在冰豆网上搜索。

mapreduce博客通俗分析总结文档格式.docx

通常作业的输入和输出都会被存储在文件系统中。

整个框架负责任务的调度和监控，以及重新执行已经失败的任务。

通常，Map/Reduce框架和分布式文件系统是运行在一组相同的节点上的，也就是说，计算节点和存储节点通常在一起。

这种配置允许框架在那些已经存好数据的节点上高效地调度任务，这可以使整个集群的网络带宽被非常高效地利用。

Map/Reduce框架由一个单独的master

JobTracker

和每个集群节点一个slave

TaskTracker共同组成。

master负责调度构成一个作业的所有任务，这些任务分布在不同的slave上，master监控它们的执行，重新执行已经失败的任务。

而slave仅负责执行由master指派的任务。

应用程序至少应该指明输入/输出的位置（路径），并通过实现合适的接口或抽象类提供map和reduce函数。

再加上其他作业的参数，就构成了作业配置（jobconfiguration）。

然后，Hadoop的

jobclient提交作业（jar包/可执行程序等）和配置信息给JobTracker，后者负责分发这些软件和配置信息给slave、调度任务并监控它们的执行，同时提供状态和诊断信息给job-client。

虽然Hadoop框架是用Java实现的，但Map/Reduce应用程序则不一定要用Java来写。

2.样例分析：

单词计数

1、WordCount源码分析

单词计数是最简单也是最能体现MapReduce思想的程序之一，该程序完整的代码可以在Hadoop安装包的src/examples目录下找到

单词计数主要完成的功能是：

统计一系列文本文件中每个单词出现的次数，如图所示：

（1）Map过程

Map过程需要继承org.apache.hadoop.mapreduce包中的Mapper类，并重写map方法

通过在map方法中添加两句把key值和value值输出到控制台的代码，可以发现map方法中的value值存储的是文本文件中的一行（以回车符作为行结束标记），而key值为该行的首字符相对于文本文件的首地址的偏移量。

然后StringTokenizer类将每一行拆分成一个个的单词，并将<

word,1>

作为map方法的结果输出，其余的工作都交由MapReduce框架处理。

其中IntWritable和Text类是Hadoop对int和string类的封装，这些类能够被串行化，以方便在分布式环境中进行数据交换。

TokenizerMapper的实现代码如下：

publicstaticclassTokenizerMapperextendsMapper<

Object,Text,Text,IntWritable>

{

privatefinalstaticIntWritableone=newIntWritable

（1）;

privateTextword=newText（）;

publicvoidmap（Objectkey,Textvalue,Contextcontext）throwsIOException,InterruptedException{

System.out.println（"

key="

+key.toString（））;

//添加查看key值

value="

+value.toString（））;

//添加查看value值

StringTokenizeritr=newStringTokenizer（value.toString（））;

while（itr.hasMoreTokens（））{

word.set（itr.nextToken（））;

context.write（word,one）;

}

（2）Reduce过程

Reduce过程需要继承org.apache.hadoop.mapreduce包中的Reducer类，并重写reduce方法

reduce方法的输入参数key为单个单词，而values是由各Mapper上对应单词的计数值所组成的列表，所以只要遍历values并求和，即可得到某个单词的出现总次数

IntSumReduce类的实现代码如下：

publicstaticclassIntSumReducerextendsReducer<

Text,IntWritable,Text,IntWritable>

{

privateIntWritableresult=newIntWritable（）;

publicvoidreduce（Textkey,Iterable<

IntWritable>

values,Contextcontext）throwsIOException,InterruptedException{

intsum=0;

for（IntWritableval:

values）{

sum+=val.get（）;

result.set（sum）;

context.write（key,result）;

（3）执行MapReduce任务

在MapReduce中，由Job对象负责管理和运行一个计算任务，并通过Job的一些方法对任务的参数进行相关的设置。

此处设置了使用TokenizerMapper完成Map过程和使用的IntSumReduce完成Combine和Reduce过程。

还设置了Map过程和Reduce过程的输出类型：

key的类型为Text，value的类型为IntWritable。

任务的输入和输出路径则由命令行参数指定，并由FileInputFormat和FileOutputFormat分别设定。

完成相应任务的参数设定后，即可调用job.waitForCompletion（）方法执行任务，主函数实现如下：

publicstaticvoidmain（String[]args）throwsException{

Configurationconf=newConfiguration（）;

String[]otherArgs=newGenericOptionsParser（conf,args）.getRemainingArgs（）;

if（otherArgs.length!

=2）{

System.err.println（"

Usage:

wordcount<

in>

out>

）;

System.exit

（2）;

Jobjob=newJob（conf,"

wordcount"

job.setJarByClass（wordCount.class）;

job.setMapperClass（TokenizerMapper.class）;

job.setCombinerClass（IntSumReducer.class）;

job.setReducerClass（IntSumReducer.class）;

job.setOutputKeyClass（Text.class）;

job.setOutputValueClass（IntWritable.class）;

FileInputFormat.addInputPath（job,newPath（otherArgs[0]））;

FileOutputFormat.setOutputPath（job,newPath（otherArgs[1]））;

System.exit（job.waitForCompletion（true）?

1）;

运行结果如下：

14/12/1705:

53:

26INFOjvm.JvmMetrics:

InitializingJVMMetricswithprocessName=JobTracker,sessionId=

26INFOinput.FileInputFormat:

Totalinputpathstoprocess:

26INFOmapred.JobClient:

Runningjob:

job_local_0001

26INFOmapred.MapTask:

io.sort.mb=100

27INFOmapred.MapTask:

databuffer=79691776/99614720

recordbuffer=262144/327680

key=0

value=HelloWorld

key=12

value=ByeWorld

Startingflushofmapoutput

Finishedspill0

27INFOmapred.TaskRunner:

Task:

attempt_local_0001_m_000000_0isdone.Andisintheprocessofcommiting

27INFOmapred.LocalJobRunner:

Task‘attempt_local_0001_m_000000_0′done.

value=HelloHadoop

key=13

value=ByeHadoop

attempt_local_0001_m_000001_0isdone.Andisintheprocessofcommiting

Task‘attempt_local_0001_m_000001_0′done.

27INFOmapred.Merger:

Merging2sortedsegments

Downtothelastmerge-pass,with2segmentsleftoftotalsize:

73bytes

attempt_local_0001_r_000000_0isdone.Andisintheprocessofcommiting

Taskattempt_local_0001_r_000000_0isallowedtocommitnow

27INFOoutput.FileOutputCommitter:

Savedoutputoftask‘attempt_local_0001_r_000000_0′toout

reduce>

reduce

Task‘attempt_local_0001_r_000000_0′done.

27INFOmapred.JobClient:

map100%reduce100%

Jobcomplete:

Counters:

FileSystemCounters

FILE_BYTES_READ=17886

HDFS_BYTES_READ=52932

FILE_BYTES_WRITTEN=54239

HDFS_BYTES_WRITTEN=71431

Map-ReduceFramework

Reduceinputgroups=4

Combineoutputrecords=6

Mapinputrecords=4

Reduceshufflebytes=0

Reduceoutputrecords=4

SpilledRecords=12

Mapoutputbytes=78

Combineinputrecords=8

Mapoutputrecords=8

Reduceinputrecords=6

2、WordCount处理过程

上面给出了WordCount的设计思路和源码，但是没有深入细节，下面对WordCount进行更加详细的分析：

（1）将文件拆分成splits，由于测试用的文件较小，所以每一个文件为一个split，并将文件按行分割成<

key,value>

对，如图，这一步由Mapreduce框架自动完成，其中偏移量包括了回车所占的字符

（2）将分割好的<

对交给用户定义的map方法进行处理，生成新的<

对

（3）得到map方法输出的<

对后，Mapper会将它们按照key值进行排序，并执行Combine过程，将key值相同的value值累加，得到Mapper的最终输出结果，如图：

（4）Reduce先对从Mapper接收的数据进行排序，再交由用户自定义的reduce方法进行处理，得到新的<

对，并作为WordCount的输出结果，如图：

3.MapReduce，你够了解吗？

MapReduce框架在幕后默默地完成了很多的事情，如果不重写map和reduce方法，会出现什么情况呢？

下面来实现一个简化的MapReduce，新建一个LazyMapReduce，该类只对任务进行必要的初始化及输入/输出路径的设置，其余的参数均保持默认

代码如下：

publicclassLazyMapReduce{

//TODOAuto-generatedmethodstub

if（otherArgs.length!

wordcount<

LazyMapReduce"

FileInputFormat.addInputPath（job,newPath（args[0]））;

FileOutputFormat.setOutputPath（job,newPath（args[1]））;

System.exit（job.waitForCompletion（true）?

1）;

运行结果为：

14/12/1723:

04:

13INFOjvm.JvmMetrics:

14INFOinput.FileInputFormat:

14INFOmapred.JobClient:

14INFOmapred.MapTask:

15INFOmapred.JobClient:

map0%reduce0%

18INFOmapred.MapTask:

19INFOmapred.MapTask:

19INFOmapred.TaskRunner:

19INFOmapred.LocalJobRunner:

展开阅读全文