Reputation: 5397
I am using kafka-2.3.0 with spark 2.2.1 along with scala 2.11. I am using the Direct Stream approach in which driver query the latest offsets and decided the offset ranges for the batch of streaming and later with those offset ranges executors read the data from Kafka. As you can see below I have a topic named test-kafka which has 4 partitions and they are distributed amongst the both two leaders.
Now, I start spark-streaming which read the data from same topic using following configuration :
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localnosql1:9092,localnosql2:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "streaming-group",
"auto.offset.reset" -> "earliest",
"auto.commit.interval.ms" -> "1000",
"enable.auto.commit" -> (true: java.lang.Boolean)
)
val topics = Array("test-kafka")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
So, now when i check info about consumer groups on CLI. It shows only one consumer-id has been assigned. Does it mean only one consumer is consuming the data from Kafka? If yes, why is it so? I have two exectuors running on the same machine where kafka is running as mentioned below.
Upvotes: 3
Views: 427
Reputation: 3433
It could be the case where you were committing offset from Driver (which valid). The Driver fetches the offsets from Kafka and handovers to Executors where KafakRDD pulls actual data from Kafka. After processing the batch, Driver commits the offset to Kafka.
I had the same question here is the link: Spark Streaming Direct Kafka Consumers are not evenly distrubuted across executors
Upvotes: 2