Reputation: 5173
I'm a newbie in Spark and i would like to understand whether i need to aggregate the DStream data by key before calling updateStateByKey?
My application basically counts the number of words in every second using Spark Streaming where i perform couple of map operations before doing a state-full update as follows,
val words = inputDstream.flatMap(x => x.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey(UpdateFunc _)
stateDstream.print()
Say after the second Map operation, same keys (words) might present across worker nodes due to various partitions, So i assume that the updateStateByKey method internally shuffles and aggregates the key values as Seq[Int] and calls the updateFunc. Is my assumption correct?
Upvotes: 2
Views: 450
Reputation: 612
updateStateByKey() does not shuffle the state , rather the new data is brought to the nodes containing the state for the same key.
Link to Tathagat's answer to a similar question : https://www.mail-archive.com/[email protected]/msg43512.html
Upvotes: 0
Reputation: 26
correct: as you can see in the method signature it takes an optional partitionNum/Partitioner argument, which denotes the number of reducers i.e. state updaters. This leads to a shuffle.
Also, I suggest to explicitly put a number there otherwise Spark may significantly decrease your job's parallelism trying to run tasks locally with respect to the location of the blocks of the HDFS checkpoint files
Upvotes: 1