Sathish
Sathish

Reputation: 5173

Does UpdateStateByKey in Spark shuffles the data across

I'm a newbie in Spark and i would like to understand whether i need to aggregate the DStream data by key before calling updateStateByKey?

My application basically counts the number of words in every second using Spark Streaming where i perform couple of map operations before doing a state-full update as follows,

val words = inputDstream.flatMap(x => x.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey(UpdateFunc _)
stateDstream.print()

Say after the second Map operation, same keys (words) might present across worker nodes due to various partitions, So i assume that the updateStateByKey method internally shuffles and aggregates the key values as Seq[Int] and calls the updateFunc. Is my assumption correct?

Upvotes: 2

Views: 450

Answers (2)

Soumitra
Soumitra

Reputation: 612

updateStateByKey() does not shuffle the state , rather the new data is brought to the nodes containing the state for the same key.

Link to Tathagat's answer to a similar question : https://www.mail-archive.com/[email protected]/msg43512.html

Upvotes: 0

alexandrosB
alexandrosB

Reputation: 26

correct: as you can see in the method signature it takes an optional partitionNum/Partitioner argument, which denotes the number of reducers i.e. state updaters. This leads to a shuffle.

Also, I suggest to explicitly put a number there otherwise Spark may significantly decrease your job's parallelism trying to run tasks locally with respect to the location of the blocks of the HDFS checkpoint files

Upvotes: 1

Related Questions