Kafka + Spark Streaming - Fairness between partitions?

Question

I have a 20 partition topic in Kafka and am reading it with Spark Streaming (8 executors, 3 cores each). I'm using the direct stream method of reading.

I'm having problems because the first 12 partitions are getting read at a faster rate than the last 8 for some reason. So, data in the last 8 is getting stale (well, staler).

Partitions 12-19 are around 90% caught up to partitions 0-11; but we're talking about billions of messages; so the stale-ness of the data 10% back in the topic partition is pretty significant.

Is this normal? Can I make sure Kafka consumes the partitions more fairly?

John Humphreys · Accepted Answer

In my particular case, it turns out that I'm hitting some sort of bug (possibly in MapR's distribution).

The bug causes the offsets of certain partitions to reset to 0 which, when observed later, causes them to just look incrementally a little behind.

I found configuration parameters which mitigate the issue, and a much larger discussion on the topic is available here: https://community.mapr.com/thread/22319-spark-streaming-mapr-streams-losing-partitions

Configuration Example - On Spark Context

 .set("spark.streaming.kafka.consumer.poll.ms", String.valueOf(Config.config.AGG_KAFKA_POLLING_MS))
 .set("spark.streaming.kafka.maxRetries", String.valueOf(10))

Edit

Confirmed that other people have had this issue as well with Spark Streaming + MapR-Streams/Kafka - this configuration seemed to lessen the chance of it happening but it did eventually come back.

You can work around it with a safety check that detects the condition and "fixes" the offset using a standard Kafka consumer prior to starting your spark stream (the problem occurs when restarting the streaming app); but you have to store the offsets externally to do this. Compounding this problem, you can't reliably provide offsets to Spark 2.1.0 streaming on start-up due to another bug; this is why you must manipulate the offsets with a consumer prior to starting the streaming; that way it is starting from offsets already stored in Kafka.

Kafka + Spark Streaming - Fairness between partitions?

Answers (2)

Related Questions