Spark整合kafka0100新特性二.docx

上传人:b****4 文档编号:3812361 上传时间:2022-11-25 格式:DOCX 页数:17 大小:89.65KB
下载 相关 举报
Spark整合kafka0100新特性二.docx_第1页
第1页 / 共17页
Spark整合kafka0100新特性二.docx_第2页
第2页 / 共17页
Spark整合kafka0100新特性二.docx_第3页
第3页 / 共17页
Spark整合kafka0100新特性二.docx_第4页
第4页 / 共17页
Spark整合kafka0100新特性二.docx_第5页
第5页 / 共17页
点击查看更多>>
下载资源
资源描述

Spark整合kafka0100新特性二.docx

《Spark整合kafka0100新特性二.docx》由会员分享,可在线阅读,更多相关《Spark整合kafka0100新特性二.docx(17页珍藏版)》请在冰豆网上搜索。

Spark整合kafka0100新特性二.docx

Spark整合kafka0100新特性二

Spark整合kafka0.10.0新特性

(二)

接着Spark整合kafka0.10.0新特性

(一)开始

importorg.apache.kafka.clients.consumer.ConsumerRecord

importmon.serialization.StringDeserializer

importorg.apache.spark.streaming.kafka010._

importorg.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent

importorg.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

valkafkaParams=Map[String,Object](

"bootstrap.servers"->"localhost:

9092,anotherhost:

9092",

"key.deserializer"->classOf[StringDeserializer],

"value.deserializer"->classOf[StringDeserializer],

"group.id"->"use_a_separate_group_id_for_each_stream",

"auto.offset.reset"->"latest",

"mit"->(false:

java.lang.Boolean)

valtopics=Array("topicA","topicB")

valstream=KafkaUtils.createDirectStream[String,String](

streamingContext,

PreferConsistent,

Subscribe[String,String](topics,kafkaParams)

stream.map(record=>(record.key,record.value))

分析完位置策略和消费策略,接下来先看看org.apache.Spark.streaming.kafka010.KafkaUtils$#createDirectStream的具体实现:

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

@Experimental

defcreateDirectStream[K,V](

ssc:

StreamingContext,

locationStrategy:

LocationStrategy,

consumerStrategy:

ConsumerStrategy[K,V]):

InputDStream[ConsumerRecord[K,V]]={

pre">

valppc=newDefaultPerPartitionConfig(ssc.sparkContext.getConf)

createDirectStream[K,V](ssc,locationStrategy,consumerStrategy,ppc)

}

返回的是InputDStream[ConsumerRecord[K,V]]类型,查看一下ConsumerRecord类型:

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

/**

*Akey/valuepairtobereceivedfromKafka.Thisconsistsofatopicnameandapartitionnumber,fromwhichthe

*recordisbeingreceivedandanoffsetthatpointstotherecordinaKafkapartition.

*从Kafka接受到的消息对key/value,包含topic名字、分区编号、以及消息在分区的offset

*/

publicfinalclassConsumerRecord{

publicstaticfinallongNO_TIMESTAMP=Record.NO_TIMESTAMP;

publicstaticfinalintNULL_SIZE=-1;

publicstaticfinalintNULL_CHECKSUM=-1;

privatefinalStringtopic;

privatefinalintpartition;

privatefinallongoffset;

privatefinallongtimestamp;

privatefinalTimestampTypetimestampType;

privatefinallongchecksum;

privatefinalintserializedKeySize;

privatefinalintserializedValueSize;

privatefinalKkey;

privatefinalVvalue;

等等省略

}

关于InputDStream具体细节略,看一下类继承结构:

所以createDirectStream返回的具体类型是DirectKafkaInputDStream。

接着在createDirectStream中创建DefaultPerPartitionConfig,DefaultPerPartitionConfig就是一个设置每一个分区获取消息的组大数率,设置参数为spark.streaming.kafka.maxRatePerPartition.源码如下:

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

packageorg.apache.spark.streaming.kafka010

importmon.TopicPartition

importorg.apache.spark.SparkConf

importorg.apache.spark.annotation.Experimental

/**

*:

:

Experimental:

:

*Interfaceforuser-suppliedconfigurationsthatcan'totherwisebesetviaSparkproperties,

*becausetheyneedtweakingonaper-partitionbasis,

*

*为用户提供的一个配置接口,但是这些参数不可以使用spark配置文件进行配置,因为spark配置文件配置,因为他们需要

*对每一个分区的比率进行调整。

可以使用SparkConf进行设置数率

*/

@Experimental

abstractclassPerPartitionConfigextendsSerializable{

/**

*Maximumrate(numberofrecordspersecond)atwhichdatawillberead

*fromeachKafkapartition.

*

*从Kafka分区中读取数据的最大比率(每秒最大记录数)

*/

defmaxRatePerPartition(topicPartition:

TopicPartition):

Long

}

/**

*Defaultper-partitionconfiguration

*/

privateclassDefaultPerPartitionConfig(conf:

SparkConf)

extendsPerPartitionConfig{

valmaxRate=conf.getLong("spark.streaming.kafka.maxRatePerPartition",0)

//从Kafka分区中读取数据的最大比率(每秒最大记录数)

defmaxRatePerPartition(topicPartition:

TopicPartition):

Long=maxRate

}

创建完毕PerPartitionConfig之后再次调用createDirectStream的重载方法:

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

defcreateDirectStream[K,V](

ssc:

StreamingContext,

locationStrategy:

LocationStrategy,

consumerStrategy:

ConsumerStrategy[K,V],

perPartitionConfig:

PerPartitionConfig

):

InputDStream[ConsumerRecord[K,V]]={

newDirectKafkaInputDStream[K,V](ssc,locationStrategy,consumerStrategy,perPartitionConfig)

}

接下来重点查看DirectKafkaInputDStream的构造器(注意:

Scala类的构造器是从类定义的左{开始到右}结束都是主构造器):

[java]viewplaincopy在CODE上查看代码片派生到我的代码片

packageorg.apache.spark.streaming.kafka010

importjava.{util=>ju}

importjava.util.concurrent.ConcurrentLinkedQueue

importjava.util.concurrent.atomic.AtomicReference

importscala.annotation.tailrec

importscala.collection.JavaConverters._

importscala.collection.mutable

importorg.apache.kafka.clients.consumer._

importmon.{PartitionInfo,TopicPartition}

importorg.apache.spark.SparkException

importorg.apache.spark.internal.Logging

importorg.apache.spark.storage.StorageLevel

importorg.apache.spark.streaming.{StreamingContext,Time}

importorg.apache.spark.streaming.dstream._

importorg.apache.spark.streaming.scheduler.{RateController,StreamInputInfo}

importorg.apache.spark.streaming.scheduler.rate.RateEstimator

/**

*

*eachgivenKafkatopic/partitioncorrespondstoanRDDpartition.

*Thesparkconfigurationspark.streaming.kafka.maxRatePerPartitiongivesthemaximumnumber

*ofmessages

*persecondthateach'''partition'''willaccept.

*

*每个topic的每一个分区对应一个RDD分区

*spark的spark.streaming.kafka.maxRatePerPartition参数配置指定了每秒每一个topic的每一个分区获取的最大消息数

*

*@paramlocationStrategyInmostcases,passin[[PreferConsistent]],

*see[[LocationStrategy]]formoredetails.

*@paramexecutorKafkaParamsKafka

*

//kafka.apache.org/documentation.html#newconsumerconfigs">

*configurationparameters.

*Requires"bootstrap.servers"tobesetwithKafkabroker(s),

*NOTzookeeperservers,specifiedinhost1:

port1,host2:

port2form.

*@paramconsumerStrategyInmostcases,passin[[Subscribe]],

*see[[ConsumerStrategy]]formoredetails

*@tparamKtypeofKafkamessagekeyKafka消息的Key

*@tparamVtypeofKafkamessagevalueKafka消息的Value

*/

private[spark]classDirectKafkaInputDStream[K,V](

_ssc:

StreamingContext,

locationStrategy:

LocationStrategy,

consumerStrategy:

ConsumerStrategy[K,V],

ppc:

PerPartitionConfig

)extendsInputDStream[ConsumerRecord[K,V]](_ssc)withLoggingwithCanCommitOffsets{

valexecutorKafkaParams={

valekp=newju.HashMap[String,Object](consumerStrategy.executorKafkaParams)

//根据具体的executor调整参数,防止在executor上出问题

KafkaUtils.fixKafkaParams(ekp)

ekp

}

//存入当前偏移的

protectedvarcurrentOffsets=Map[TopicPartition,Long]()

//如果偏移量为1的话,则设置偏移为1

@transientprivatevarkc:

Consumer[K,V]=null

defconsumer():

Consumer[K,V]=this.synchronized{

if(null==kc){

kc=consumerStrategy.onStart(currentOffsets.mapValues(l=>newjava.lang.Long(l)).asJava)

}

kc

}

overridedefpersist(newLevel:

StorageLevel):

DStream[ConsumerRecord[K,V]]={

logError("KafkaConsumerRecordisnotserializable."+

"Use.maptoextractfieldsbeforecalling.persistor.window")

super.persist(newLevel)

}

//

protecteddefgetBrokers={

valc=consumer

valresult=newju.HashMap[TopicPartition,String]()

valhosts=newju.HashMap[TopicPartition,String]()

//assignment()获取该Consumer的TopicPartition,返回Set集合

valassignments=c.assignment().iterator()

//两层while循环实现获取

while(assignments.hasNext()){

valtp:

TopicPartition=assignments.next()

//当前的TopicPartition的主机地址没有的话,需要根据去kafka集群查找该TopicPartition的主机地址

if(null==hosts.get(tp)){

//partitionsFor获取给定topic和partition的元数据,如果本地没有会发起rpc

valinfos=c.partitionsFor(tp.topic).iterator()

while(infos.hasNext()){

vali=infos.next()

//TopicPartition重写了equals方法

hosts.put(newTopicPartition(i.topic(),i.partition()),i.leader.host())

}

}

//TopicPartition重写了equals方法,所以可以hosts.get(tp)

//到此处就获取到了分区和分区的地址

result.put(tp,hosts.get(tp))

}

result

}

protecteddefgetPreferredHosts:

ju.Map[TopicPartition,String]={

locationStrategymatch{

casePreferBrokers=>getBrokers

casePreferConsistent=>ju.Collections.emptyMap[TopicPartition,String]()

casePreferFixed(hostMap)=>hostMap

}

}

//Keepthisconsistentwithhowotherstreamsarenamed(e.g."Flumepollingstream[2]")

private[streaming]overridedefname:

String=s"Kafka0.10directstream[$id]"

protected[streaming]overridevalcheckpointData=

newDirectKafkaInputDStreamCheckpointData

/**

*Asynchronouslymaintains&sendsnewratelimitstothereceiverthroughthereceivertracker.

*/

overrideprotected[streaming]valrateController:

Option[RateController]={

if(RateController.isBackPressureEnabled(ssc.conf)){

Some(newDirectKafkaRateController(id,

RateEstimator.create(ssc.conf,context.graph.batchDuration)))

}else{

None

}

}

protected[streaming]defmaxMessagesPerPartition(

offsets:

Map[TopicPartition,Long]):

Option[Map[TopicPartition,Long]]={

valestimatedRateLimit=rateController.map(_.getLatestRate())

//calculateaper-partitionratelimitbasedoncurrentlag

valeffectiveRateLimitPerPartition=estimatedRateLimit.filter(_>0)match{

caseSme(rate)=>

vallagPerPartition=offsets.map{case(tp,offset)=>

tp->Math.max(offset-currentOffsets(tp),0)

}

valtotalLag=lagPerPartition.values.sum

lagPerPartition.map{case(tp,lag)=>

valmaxRateLimitPerPartition=ppc.maxRatePerPartition(tp)

valbackpressureRate=Math.round(lag/totalLag.toFloat*rate)

tp->(if(maxRateLimitPerPartition>0){

Math.min(backpressureRate,maxRateLimitPerPartition)

}elsebackpressureRate)

}

caseNone=>

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 考试认证 > 交规考试

copyright@ 2008-2022 冰豆网网站版权所有

经营许可证编号:鄂ICP备2022015515号-1