Reputation: 612
I am using updateStateByKey()
operation to maintain state in my Spark Streaming application. The input data is coming through a Kafka topic.
The state I have to maintaining contains ~ 100000 keys and I want to avoid shuffle every time I update the state , any tips to do it ?
Upvotes: 4
Views: 1266
Reputation: 612
Link to the answer to the same question by Tathagat Das :
https://www.mail-archive.com/[email protected]/msg43512.html
Following is the text :
Both mapWithState()
and updateStateByKey()
by default uses the HashPartitioner
, and hashes the key in the key-value DStream
on which the state operation is applied. The new data and state is partition in the exact same partitioner, so that same keys from the new data (from the input DStream
) get shuffled and colocated with the already partitioned state RDDs. So the new data is brought to the corresponding old state in the same machine and then the
state mapping /updating function is applied.
The state is not shuffled every time, only the batches of new data is shuffled in every batch
Upvotes: 2
Reputation: 26
updateStateByKey takes a Partitioner as it's second argument. http://spark.apache.org/docs/1.6.0/api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions
Upvotes: 0