大数据Spark Shuffle二ExecutorDriver之间Shuffle结果消息传递追踪.docx

资源描述

大数据Spark Shuffle二ExecutorDriver之间Shuffle结果消息传递追踪.docx

《大数据Spark Shuffle二ExecutorDriver之间Shuffle结果消息传递追踪.docx》由会员分享，可在线阅读，更多相关《大数据Spark Shuffle二ExecutorDriver之间Shuffle结果消息传递追踪.docx（15页珍藏版）》请在冰豆网上搜索。

大数据Spark Shuffle二ExecutorDriver之间Shuffle结果消息传递追踪.docx

大数据SparkShuffle二ExecutorDriver之间Shuffle结果消息传递追踪

大数据：

SparkShuffle

（二）Executor、Driver之间Shuffle结果消息传递、追踪

1.前言

输出Shuffle结果到Shuffle_shuffleId_mapId_0.data数据文件中，每个executor需要向Driver汇报当前节点的Shuffle结果状态，Driver保存结果信息进行下个Task的调度。

2.StatusUpdate消息

当Executor运行完Task的时候需要向Driver汇报StatusUpdate的消息

[plain]viewplaincopy

overridedefstatusUpdate（taskId:

Long,state:

TaskState,data:

ByteBuffer）{

valmsg=StatusUpdate（executorId,taskId,state,data）

drivermatch{

caseSome（driverRef）=>driverRef.send（msg）

caseNone=>logWarning（s"Drop$msgbecausehasnotyetconnectedtodriver"）

}

整个结构体中包含了

ExecutorId:

Executor自己的ID

TaskId:

task分配的ID

State:

Task的运行状态

[plain]viewplaincopy

LAUNCHING,RUNNING,FINISHED,FAILED,KILLED,LOST

Data:

保存序列化的Result

2.1Executor端发送

在Task运行后的结果，Executor会将结果首先序列化成ByteBuffer封装成DirectTaskResult，再次序列化DirectTaskResult成ByteBuffer，很显然序列化的结果的大小会决定不同的传递策略。

在这里会有两个筏值来控制

最大的返回结果大小，如果超过设定的最大返回结果时，返回的结果内容会被丢弃，只是返回序列化的InDirectTaskResult，里面包含着BlockID和序列化后的结果大小

[plain]viewplaincopy

spark.driver.maxResultSize

最大的直接返回结果大小：

如果返回的结果大于最大的直接返回结果大小，小于最大的返回结果大小，采用了保存的折中的策略，将序列化DirectTaskResult保存到BlockManager中，关于BlockManager可以参考前面写的BlockManager系列，返回InDirectTaskResult，里面包含着BlockID和序列化的结果大小

[plain]viewplaincopy

spark.task.maxDirectResultSize

直接返回：

如果返回的结果小于等于最大的直接返回结果大小，将直接将序列化的DirectTaskResult返回给Driver端

[plain]viewplaincopy

valserializedResult:

ByteBuffer={

if（maxResultSize>0&&resultSize>maxResultSize）{

logWarning（s"Finished$taskName（TID$taskId）.ResultislargerthanmaxResultSize"+

s"（${Utils.bytesToString（resultSize）}>${Utils.bytesToString（maxResultSize）}）,"+

s"droppingit."）

ser.serialize（newIndirectTaskResult[Any]（TaskResultBlockId（taskId）,resultSize））

}elseif（resultSize>maxDirectResultSize）{

valblockId=TaskResultBlockId（taskId）

env.blockManager.putBytes（

blockId,

newChunkedByteBuffer（serializedDirectResult.duplicate（））,

StorageLevel.MEMORY_AND_DISK_SER）

logInfo（

s"Finished$taskName（TID$taskId）.$resultSizebytesresultsentviaBlockManager）"）

ser.serialize（newIndirectTaskResult[Any]（blockId,resultSize））

}else{

logInfo（s"Finished$taskName（TID$taskId）.$resultSizebytesresultsenttodriver"）

serializedDirectResult

}

2.2Driver端接收

Driver端处理StatusUpdate的消息的代码如下：

[plain]viewplaincopy

caseStatusUpdate（executorId,taskId,state,data）=>

scheduler.statusUpdate（taskId,state,data.value）

if（TaskState.isFinished（state））{

executorDataMap.get（executorId）match{

caseSome（executorInfo）=>

executorInfo.freeCores+=scheduler.CPUS_PER_TASK

makeOffers（executorId）

caseNone=>

//Ignoringtheupdatesincewedon'tknowabouttheexecutor.

logWarning（s"Ignoredtaskstatusupdate（$taskIdstate$state）"+

s"fromunknownexecutorwithID$executorId"）

}

scheduler实例是TaskSchedulerImpl.scala

[plain]viewplaincopy

if（TaskState.isFinished（state））{

cleanupTaskState（tid）

taskSet.removeRunningTask（tid）

if（state==TaskState.FINISHED）{

taskResultGetter.enqueueSuccessfulTask（taskSet,tid,serializedData）

}elseif（Set（TaskState.FAILED,TaskState.KILLED,TaskState.LOST）.contains（state））{

taskResultGetter.enqueueFailedTask（taskSet,tid,state,serializedData）

}

statusUpdate函数调用了enqueueSuccessfulTask方法

[plain]viewplaincopy

defenqueueSuccessfulTask（

taskSetManager:

TaskSetManager,

tid:

Long,

serializedData:

ByteBuffer）:

Unit={

getTaskResultExecutor.execute（newRunnable{

overridedefrun（）:

Unit=Utils.logUncaughtExceptions{

try{

val（result,size）=serializer.get（）.deserialize[TaskResult[_]]（serializedData）match{

casedirectResult:

DirectTaskResult[_]=>

if（!

taskSetManager.canFetchMoreResults（serializedData.limit（）））{

return

}

//deserialize"value"withoutholdinganylocksothatitwon'tblockotherthreads.

//Weshouldcallithere,sothatwhenit'scalledagainin

//"TaskSetManager.handleSuccessfulTask",itdoesnotneedtodeserializethevalue.

directResult.value（taskResultSerializer.get（））

（directResult,serializedData.limit（））

caseIndirectTaskResult（blockId,size）=>

if（!

taskSetManager.canFetchMoreResults（size））{

//droppedbyexecutorifsizeislargerthanmaxResultSize

sparkEnv.blockManager.master.removeBlock（blockId）

return

}

logDebug（"FetchingindirecttaskresultforTID%s".format（tid））

scheduler.handleTaskGettingResult（taskSetManager,tid）

valserializedTaskResult=sparkEnv.blockManager.getRemoteBytes（blockId）

if（!

serializedTaskResult.isDefined）{

/*Wewon'tbeabletogetthetaskresultifthemachinethatranthetaskfailed

*betweenwhenthetaskendedandwhenwetriedtofetchtheresult,orifthe

*blockmanagerhadtoflushtheresult.*/

scheduler.handleFailedTask（

taskSetManager,tid,TaskState.FINISHED,TaskResultLost）

return

}

valdeserializedResult=serializer.get（）.deserialize[DirectTaskResult[_]]（

serializedTaskResult.get.toByteBuffer）

//forcedeserializationofreferencedvalue

deserializedResult.value（taskResultSerializer.get（））

sparkEnv.blockManager.master.removeBlock（blockId）

（deserializedResult,size）

}

//Setthetaskresultsizeintheaccumulatorupdatesreceivedfromtheexecutors.

//Weneedtodothishereonthedriverbecauseifwedidthisontheexecutorsthen

//wewouldhavetoserializetheresultagainafterupdatingthesize.

result.accumUpdates=result.accumUpdates.map{a=>

if（a.name==Some（InternalAccumulator.RESULT_SIZE））{

valacc=a.asInstanceOf[LongAccumulator]

assert（acc.sum==0L,"taskresultsizeshouldnothavebeensetontheexecutors"）

acc.setValue（size.toLong）

acc

}else{

}

scheduler.handleSuccessfulTask（taskSetManager,tid,result）

}catch{

casecnf:

ClassNotFoundException=>

valloader=Thread.currentThread.getContextClassLoader

taskSetManager.abort（"ClassNotFoundwithclassloader:

"+loader）

//MatchingNonFatalsowedon'tcatchtheControlThrowablefromthe"return"above.

caseNonFatal（ex）=>

logError（"Exceptionwhilegettingtaskresult",ex）

taskSetManager.abort（"Exceptionwhilegettingtaskresult:

%s".format（ex））

}

}）

}

在函数中，反序列化的过程是通过线程池里的线程来运行的，Netty的接收数据线程是不能被堵塞（同时还接受着别的消息），反序列化是耗时的任务，不能在Netty的消息处理线程中运行。

2.2.1DirectTaskResult处理过程

直接反序列化成DirectTaskResult，反序列化后进行了整体返回内容的大小的判断，在前面的2.1中介绍参数：

spark.driver.maxResultSize，这个参数是Driver端的参数控制的，在Spark中会启动多个Task，参数的控制是一个整体的控制所有的Tasks的返回结果的数量大小，当然单个task使用该筏值的控制也是没有问题，因为只要有一个任务返回的结果超过maxResultSize，整体返回的数据也会超过maxResultSize。

对DirectTaskResult里的result进行了反序列化。

2.2.2InDirectTaskResult处理过程

通过size判断大小是否超过spark.driver.maxResultSize筏值控制

通过BlockManager来获取BlockID的内容反序列化成DirectTaskResult

对DirectTaskResult里的result进行了反序列化

最后调用handleSuccessfulTask方法

[plain]viewplaincopy

sched.dagScheduler.taskEnded（tasks（index）,Success,result.value（）,result.accumUpdates,info）

回到了Dag的调度，向eventProcessLoop的队列里提交了CompletionEvent的事件

[plain]viewplaincopy

deftaskEnded（

task:

Task[_],

reason:

TaskEndReason,

result:

Any,

accumUpdates:

Seq[AccumulatorV2[_,_]],

taskInfo:

TaskInfo）:

Unit={

eventProcessLoop.post（

CompletionEvent（task,reason,result,accumUpdates,taskInfo））

}

处理eventProcessLoop队列的event是在DAG的线程处理的，在这里我们不讨论DAG的任务调度。

2.3MapOutputTracker

MapOutputTracker是当运行完ShuffleMapTask的时候，ShuffleWrite会生成Shuffle_shuffleId_mapId_0.data、index文件，Executor需要将具体的信息返回给Driver，当Driver进行下一步的Task运算的时候，Executor也需要获取具体Shuffle数据文件的信息进行下一步的action算子的运算，结构的保存、管理就是通过MapOutputTracker跟踪器进行追踪的。

2.3.1RegisterMapOutput

Execute端

在ShuffleMapTask中运行后会生成一个MapStatus，也就是上图的Map0结构，ComressedMapStatus、HighlyCompressedMapStatus这里的两个区别主要是增对Partition1...的sizelong的压缩，但这里的压缩算法并不准确比，如CompressedMapStatus的算法：

[plain]viewplaincopy

defcompressSize（size:

Long）:

Byte={

if（size==0）{

}elseif（size<=1L）{

}else{

math.min（255,math.ceil（math.log（size）/math.log（LOG_BASE））.toInt）.toByte

}

求Log1.1（size）的整数转为byte，也就是支持最大1.1^255=35G左右

为何不需要计算精准的尺寸？

还记得前面博客里提到的Shuffle_shuffleId_mapId_reduceId.index文件么，这里才是精准的位置，当读取本地文件的时候，并不使用MapStatus里的Size

Size有何用？

有存在别的Execute获取别的Execute的Shuffle结果文件，此时的size是获取文件的大概位置。

MapStatus是ShuffleMapTask运行的结果，被序列化成DirectTaskResult中的value，通过StatusUpdate消息传递

Driver端

DAG线程调度处理CompletionEvent的事件

[plain]viewplaincopy

private[scheduler]defhandleTaskCompletion（event:

CompletionEvent）{

............

casesmt:

ShuffleMapTask=>

valshuffleStage=stage.asInstanceOf[ShuffleMapStage]

updateAccumulators（event）

valstatus=event.result.asInstanceOf[MapStatus]

valexecId=atus.location.executorId

logDebug（"ShuffleMapTaskfinishedon"+execId）

if（failedEpoch.contains（execId）&&smt.epoch<=failedEpoch（execId））{

logInfo（s"Ignoringpossiblybogus$smtcompletionfromexecutor$execId"）

}else{

shuffleStage.addOutputLoc（smt.partitionId,status）

}

if（runningStages.contains（shuffleStage）&&shuffleStage.pendingPartitions.isEmpty）{

markStageAsFinished（shuffleStage）

logInfo（"lookingfornewlyrunnablestages"）

logInfo（"running:

"+runningStages）

logInfo（"waiting:

"+waitingStages）

logInfo（"failed:

"+failedStages）

//Wesupplytruetoincrementtheepochnumberhereincasethisisa

//recomputationofthemapoutputs.Inthatcase,somenodesmayhavecached

//locationswithholes（fromwhenwedetectedtheerror）andwillneedthe

//epochincrementedtorefetchthem.

//TODO:

Onlyincrementtheepochnumberifthisisnotthefirsttime

//weregisteredthesemapoutputs.

mapOutputTracker.registerMapOutputs（

shuffleStage.shuffleDep.shuffleId,

shuffleStage.outputLocInMapOutputTrackerFormat（）,

changeEpoch=true）

clearCacheLocs（）

if（!

shuffleStage.isAvailable）{

//Som

展开阅读全文