MaatDeamon
MaatDeamon

Reputation: 9771

Spark-Streaming Kafka Direct Streaming API & Parallelism

I understood the automated mapping that exists between a Kafka Partition and a Spark RDD partition and ultimately Spark Task. However in order to properly Size My Executor (in number of Core) and therefore ultimately my node and cluster, I need to understand something that seems to be glossed over in the documentations.

In Spark-Streaming how does exactly work the data consumption vs data processing vs task allocation, in other words:

  1. Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?

Can someone outline with clarity what is exactly going on here ?

EDIT1

We don't even have to have this memory limit control. Just the mere fact of being able to fetch while the processing is going on and stopping right there. In other words, the two process should be asynchronous and the limit is simply to be one step ahead. To me if somehow this is not happening, i find it extremely strange that Spark would implement something that break performance as such.

Upvotes: 6

Views: 1568

Answers (1)

Yuval Itzchakov
Yuval Itzchakov

Reputation: 149608

Does a corresponding Spark task to a Kafka partition both read and process the data altogether ?

The relationship is very close to what you describe, if by talking about a task we're referring to the part of the graph that reads from kafka up until a shuffle operation. The flow of execution is as follows:

  1. Driver reads offsets from all kafka topics and partitions
  2. Driver assigns each executor a topic and partition to be read and processed.
  3. Unless there is a shuffle boundary operation, it is likely that Spark will optimize the entire execution of the partition on the same executor.

This means that a single executor will read a given TopicPartition and process the entire execution graph on it, unless we need to shuffle. Since a Kafka partition maps to a partition inside the RDD, we get that guarantee.

Structured Streaming takes this even further. In Structured Streaming, there is stickiness between the TopicPartition and the worker/executor. Meaning, if a given worker was assigned a TopicPartition it is likely to continue processing it for the entire lifetime of the application.

Upvotes: 2

Related Questions