Spark整合kafka0100新特性二.docx

资源描述

Spark整合kafka0100新特性二.docx

《Spark整合kafka0100新特性二.docx》由会员分享，可在线阅读，更多相关《Spark整合kafka0100新特性二.docx（17页珍藏版）》请在冰豆网上搜索。

Spark整合kafka0100新特性二.docx

Spark整合kafka0100新特性二

Spark整合kafka0.10.0新特性

（二）

接着Spark整合kafka0.10.0新特性

（一）开始

importorg.apache.kafka.clients.consumer.ConsumerRecord

importmon.serialization.StringDeserializer

importorg.apache.spark.streaming.kafka010._

importorg.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent

importorg.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

valkafkaParams=Map[String,Object]（

"bootstrap.servers"->"localhost:

9092,anotherhost:

9092",

"key.deserializer"->classOf[StringDeserializer],

"value.deserializer"->classOf[StringDeserializer],

"group.id"->"use_a_separate_group_id_for_each_stream",

"auto.offset.reset"->"latest",

"mit"->（false:

java.lang.Boolean）

）

valtopics=Array（"topicA","topicB"）

valstream=KafkaUtils.createDirectStream[String,String]（

streamingContext,

PreferConsistent,

Subscribe[String,String]（topics,kafkaParams）

）

stream.map（record=>（record.key,record.value））

分析完位置策略和消费策略，接下来先看看org.apache.Spark.streaming.kafka010.KafkaUtils$#createDirectStream的具体实现：

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

@Experimental

defcreateDirectStream[K,V]（

ssc:

StreamingContext,

locationStrategy:

LocationStrategy,

consumerStrategy:

ConsumerStrategy[K,V]）:

InputDStream[ConsumerRecord[K,V]]={

pre">

valppc=newDefaultPerPartitionConfig（ssc.sparkContext.getConf）

createDirectStream[K,V]（ssc,locationStrategy,consumerStrategy,ppc）

}

返回的是InputDStream[ConsumerRecord[K,V]]类型，查看一下ConsumerRecord类型：

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

/**

*Akey/valuepairtobereceivedfromKafka.Thisconsistsofatopicnameandapartitionnumber,fromwhichthe

*recordisbeingreceivedandanoffsetthatpointstotherecordinaKafkapartition.

*从Kafka接受到的消息对key/value，包含topic名字、分区编号、以及消息在分区的offset

publicfinalclassConsumerRecord{

publicstaticfinallongNO_TIMESTAMP=Record.NO_TIMESTAMP;

publicstaticfinalintNULL_SIZE=-1;

publicstaticfinalintNULL_CHECKSUM=-1;

privatefinalStringtopic;

privatefinalintpartition;

privatefinallongoffset;

privatefinallongtimestamp;

privatefinalTimestampTypetimestampType;

privatefinallongchecksum;

privatefinalintserializedKeySize;

privatefinalintserializedValueSize;

privatefinalKkey;

privatefinalVvalue;

等等省略

}

关于InputDStream具体细节略，看一下类继承结构：

所以createDirectStream返回的具体类型是DirectKafkaInputDStream。

接着在createDirectStream中创建DefaultPerPartitionConfig，DefaultPerPartitionConfig就是一个设置每一个分区获取消息的组大数率，设置参数为spark.streaming.kafka.maxRatePerPartition.源码如下：

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

packageorg.apache.spark.streaming.kafka010

importmon.TopicPartition

importorg.apache.spark.SparkConf

importorg.apache.spark.annotation.Experimental

/**

Experimental:

*Interfaceforuser-suppliedconfigurationsthatcan'totherwisebesetviaSparkproperties,

*becausetheyneedtweakingonaper-partitionbasis,

*为用户提供的一个配置接口，但是这些参数不可以使用spark配置文件进行配置，因为spark配置文件配置，因为他们需要

*对每一个分区的比率进行调整。

可以使用SparkConf进行设置数率

@Experimental

abstractclassPerPartitionConfigextendsSerializable{

/**

*Maximumrate（numberofrecordspersecond）atwhichdatawillberead

*fromeachKafkapartition.

*从Kafka分区中读取数据的最大比率（每秒最大记录数）

defmaxRatePerPartition（topicPartition:

TopicPartition）:

Long

}

/**

*Defaultper-partitionconfiguration

privateclassDefaultPerPartitionConfig（conf:

SparkConf）

extendsPerPartitionConfig{

valmaxRate=conf.getLong（"spark.streaming.kafka.maxRatePerPartition",0）

//从Kafka分区中读取数据的最大比率（每秒最大记录数）

defmaxRatePerPartition（topicPartition:

TopicPartition）:

Long=maxRate

}

创建完毕PerPartitionConfig之后再次调用createDirectStream的重载方法：

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

defcreateDirectStream[K,V]（

ssc:

StreamingContext,

locationStrategy:

LocationStrategy,

consumerStrategy:

ConsumerStrategy[K,V],

perPartitionConfig:

PerPartitionConfig

）:

InputDStream[ConsumerRecord[K,V]]={

newDirectKafkaInputDStream[K,V]（ssc,locationStrategy,consumerStrategy,perPartitionConfig）

}

接下来重点查看DirectKafkaInputDStream的构造器（注意：

Scala类的构造器是从类定义的左{开始到右}结束都是主构造器）：

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

packageorg.apache.spark.streaming.kafka010

importjava.{util=>ju}

importjava.util.concurrent.ConcurrentLinkedQueue

importjava.util.concurrent.atomic.AtomicReference

importscala.annotation.tailrec

importscala.collection.JavaConverters._

importscala.collection.mutable

importorg.apache.kafka.clients.consumer._

importmon.{PartitionInfo,TopicPartition}

importorg.apache.spark.SparkException

importorg.apache.spark.internal.Logging

importorg.apache.spark.storage.StorageLevel

importorg.apache.spark.streaming.{StreamingContext,Time}

importorg.apache.spark.streaming.dstream._

importorg.apache.spark.streaming.scheduler.{RateController,StreamInputInfo}

importorg.apache.spark.streaming.scheduler.rate.RateEstimator

/**

*eachgivenKafkatopic/partitioncorrespondstoanRDDpartition.

*Thesparkconfigurationspark.streaming.kafka.maxRatePerPartitiongivesthemaximumnumber

*ofmessages

*persecondthateach'''partition'''willaccept.

*每个topic的每一个分区对应一个RDD分区

*spark的spark.streaming.kafka.maxRatePerPartition参数配置指定了每秒每一个topic的每一个分区获取的最大消息数

*@paramlocationStrategyInmostcases,passin[[PreferConsistent]],

*see[[LocationStrategy]]formoredetails.

*@paramexecutorKafkaParamsKafka

//kafka.apache.org/documentation.html#newconsumerconfigs">

*configurationparameters.

*Requires"bootstrap.servers"tobesetwithKafkabroker（s）,

*NOTzookeeperservers,specifiedinhost1:

port1,host2:

port2form.

*@paramconsumerStrategyInmostcases,passin[[Subscribe]],

*see[[ConsumerStrategy]]formoredetails

*@tparamKtypeofKafkamessagekeyKafka消息的Key

*@tparamVtypeofKafkamessagevalueKafka消息的Value

private[spark]classDirectKafkaInputDStream[K,V]（

_ssc:

StreamingContext,

locationStrategy:

LocationStrategy,

consumerStrategy:

ConsumerStrategy[K,V],

ppc:

PerPartitionConfig

）extendsInputDStream[ConsumerRecord[K,V]]（_ssc）withLoggingwithCanCommitOffsets{

valexecutorKafkaParams={

valekp=newju.HashMap[String,Object]（consumerStrategy.executorKafkaParams）

//根据具体的executor调整参数，防止在executor上出问题

KafkaUtils.fixKafkaParams（ekp）

ekp

}

//存入当前偏移的

protectedvarcurrentOffsets=Map[TopicPartition,Long]（）

//如果偏移量为1的话，则设置偏移为1

@transientprivatevarkc:

Consumer[K,V]=null

defconsumer（）:

Consumer[K,V]=this.synchronized{

if（null==kc）{

kc=consumerStrategy.onStart（currentOffsets.mapValues（l=>newjava.lang.Long（l））.asJava）

}

overridedefpersist（newLevel:

StorageLevel）:

DStream[ConsumerRecord[K,V]]={

logError（"KafkaConsumerRecordisnotserializable."+

"Use.maptoextractfieldsbeforecalling.persistor.window"）

super.persist（newLevel）

}

protecteddefgetBrokers={

valc=consumer

valresult=newju.HashMap[TopicPartition,String]（）

valhosts=newju.HashMap[TopicPartition,String]（）

//assignment（）获取该Consumer的TopicPartition，返回Set集合

valassignments=c.assignment（）.iterator（）

//两层while循环实现获取

while（assignments.hasNext（））{

valtp:

TopicPartition=assignments.next（）

//当前的TopicPartition的主机地址没有的话，需要根据去kafka集群查找该TopicPartition的主机地址

if（null==hosts.get（tp））{

//partitionsFor获取给定topic和partition的元数据，如果本地没有会发起rpc

valinfos=c.partitionsFor（tp.topic）.iterator（）

while（infos.hasNext（））{

vali=infos.next（）

//TopicPartition重写了equals方法

hosts.put（newTopicPartition（i.topic（）,i.partition（））,i.leader.host（））

}

//TopicPartition重写了equals方法，所以可以hosts.get（tp）

//到此处就获取到了分区和分区的地址

result.put（tp,hosts.get（tp））

}

result

}

protecteddefgetPreferredHosts:

ju.Map[TopicPartition,String]={

locationStrategymatch{

casePreferBrokers=>getBrokers

casePreferConsistent=>ju.Collections.emptyMap[TopicPartition,String]（）

casePreferFixed（hostMap）=>hostMap

}

//Keepthisconsistentwithhowotherstreamsarenamed（e.g."Flumepollingstream[2]"）

private[streaming]overridedefname:

String=s"Kafka0.10directstream[$id]"

protected[streaming]overridevalcheckpointData=

newDirectKafkaInputDStreamCheckpointData

/**

*Asynchronouslymaintains&sendsnewratelimitstothereceiverthroughthereceivertracker.

overrideprotected[streaming]valrateController:

Option[RateController]={

if（RateController.isBackPressureEnabled（ssc.conf））{

Some（newDirectKafkaRateController（id,

RateEstimator.create（ssc.conf,context.graph.batchDuration）））

}else{

None

}

protected[streaming]defmaxMessagesPerPartition（

offsets:

Map[TopicPartition,Long]）:

Option[Map[TopicPartition,Long]]={

valestimatedRateLimit=rateController.map（_.getLatestRate（））

//calculateaper-partitionratelimitbasedoncurrentlag

valeffectiveRateLimitPerPartition=estimatedRateLimit.filter（_>0）match{

caseSme（rate）=>

vallagPerPartition=offsets.map{case（tp,offset）=>

tp->Math.max（offset-currentOffsets（tp）,0）

}

valtotalLag=lagPerPartition.values.sum

lagPerPartition.map{case（tp,lag）=>

valmaxRateLimitPerPartition=ppc.maxRatePerPartition（tp）

valbackpressureRate=Math.round（lag/totalLag.toFloat*rate）

tp->（if（maxRateLimitPerPartition>0）{

Math.min（backpressureRate,maxRateLimitPerPartition）

}elsebackpressureRate）

}

caseNone=>

展开阅读全文