How can I combine a DStream pair of key and value using the same key?

Question

I want to change the first DStream to become the second using spark. But I don't know how to do this? I already tried groupByKey(), which didn't work and aggregateByKey(), which only uses RDD not DStream.

This is the current result:

DStream [(1,value1),(2,value2),(3,value3),(1,value4),(1,value5),(2,value6)]

This is the result I want:

DStream(1,(value1,value4,value5)) ,(2,(value2,value5)) ,(3,(value3))

Thanks for your replies.

zero323 · Accepted Answer

groupByKey does exactly this. It converts DStream[K, V] into DStream[(K, Seq[V])]. I suspect that your expectations about the output may be wrong. Since DStream is just an infinite sequence of RDDs group is applied to each RDD individually. So if the first batch contains:

(1,value1),(2,value2),(3,value3),(1,value4)

and the second

(1,value5),(2,value6)

you'll get

(1, [value1, value4]), (2, [value2]), (3, value3)

and

(1,[value5]),(2,[value6])

respectively.

While DStreams support stateful operations (updateStateByKey) it is rather unlikely you want to use this with growing collections.

How can I combine a DStream pair of key and value using the same key?

Answers (1)

Related Questions