I am using updateStateByKey() operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic. I want to understand how are DStreams partitioned? How does the partitioning work with mapWithState() or updateStatebyKey() method? In updateStateByKey() does the old state and the new values against a given key processed on same node ? How frequent is the shuffle for updateStateByKey() method ? The state I have to maintaining contains ~ 100000 keys and I want to avoid shuffle every time I update the state , any tips to do it ?

performanceapache-sparkapache-kafkaspark-streaming

Reputation: 612

How does partitioning of DStreams ( for updateStateByKey() ) work and how to verify it?

I am using updateStateByKey() operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic.

I want to understand how are DStreams partitioned?
How does the partitioning work with mapWithState() or updateStatebyKey() method?
In updateStateByKey() does the old state and the new values against a given key processed on same node ?
How frequent is the shuffle for updateStateByKey() method ?

The state I have to maintaining contains ~ 100000 keys and I want to avoid shuffle every time I update the state , any tips to do it ?

Upvotes: 4

Answers (2)

Soumitra

Reputation: 612

Link to the answer to the same question by Tathagat Das :

https://www.mail-archive.com/[email protected]/msg43512.html

Following is the text :

Both mapWithState() and updateStateByKey() by default uses the HashPartitioner, and hashes the key in the key-value DStream on which the state operation is applied. The new data and state is partition in the exact same partitioner, so that same keys from the new data (from the input DStream) get shuffled and colocated with the already partitioned state RDDs. So the new data is brought to the corresponding old state in the same machine and then the state mapping /updating function is applied.

The state is not shuffled every time, only the batches of new data is shuffled in every batch

Upvotes: 2

Steve

Reputation: 26

updateStateByKey takes a Partitioner as it's second argument. http://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions

Upvotes: 0

How does partitioning of DStreams ( for updateStateByKey() ) work and how to verify it?

Answers (2)

Related Questions