Does UpdateStateByKey in Spark shuffles the data across

Question

I'm a newbie in Spark and i would like to understand whether i need to aggregate the DStream data by key before calling updateStateByKey?

My application basically counts the number of words in every second using Spark Streaming where i perform couple of map operations before doing a state-full update as follows,

val words = inputDstream.flatMap(x => x.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey(UpdateFunc _)
stateDstream.print()

Say after the second Map operation, same keys (words) might present across worker nodes due to various partitions, So i assume that the updateStateByKey method internally shuffles and aggregates the key values as Seq[Int] and calls the updateFunc. Is my assumption correct?

alexandrosB · Accepted Answer

correct: as you can see in the method signature it takes an optional partitionNum/Partitioner argument, which denotes the number of reducers i.e. state updaters. This leads to a shuffle.

Also, I suggest to explicitly put a number there otherwise Spark may significantly decrease your job's parallelism trying to run tasks locally with respect to the location of the blocks of the HDFS checkpoint files

Does UpdateStateByKey in Spark shuffles the data across

Answers (2)

Related Questions