Spark continuous processing mode does not read all kafka topic partition

Question

I'm experimenting with Spark's Continuous Processing mode in Structured Streaming and I'm reading from a Kafka topic with 2 partitions while the Spark application has only one executor with one core.

The application is a simple one where it simply reads from the first topic and publishes on the second one. The problem is my console-consumer that reads from the second topic it sees only messages from one partition of the first topic. This means my Spark application reads only messages from one partition of the topic.

How can I make my Spark application read from both partitions of the topic?

Note

I'm asking this question for people that might run into the same issue as me

M-Doru · Accepted Answer

I found the answer for my question in the Spark Structured Streaming documentation in the caveats section

Basically, in the continuous processing mode spark launches long running tasks that read from one partition of the topic hence as only one task per core can run, the spark application needs to have as many cores as kafka topic partitions it reads from.

Spark continuous processing mode does not read all kafka topic partition

Answers (1)

Related Questions