Reputation: 684
I want to change the first DStream to become the second using spark. But I don't know how to do this? I already tried groupByKey(), which didn't work and aggregateByKey(), which only uses RDD not DStream.
This is the current result:
DStream [(1,value1),(2,value2),(3,value3),(1,value4),(1,value5),(2,value6)]
This is the result I want:
DStream(1,(value1,value4,value5)) ,(2,(value2,value5)) ,(3,(value3))
Thanks for your replies.
Upvotes: 1
Views: 947
Reputation: 330093
groupByKey
does exactly this. It converts DStream[K, V]
into DStream[(K, Seq[V])]
. I suspect that your expectations about the output may be wrong. Since DStream
is just an infinite sequence of RDDs
group is applied to each RDD
individually. So if the first batch contains:
(1,value1),(2,value2),(3,value3),(1,value4)
and the second
(1,value5),(2,value6)
you'll get
(1, [value1, value4]), (2, [value2]), (3, value3)
and
(1,[value5]),(2,[value6])
respectively.
While DStreams
support stateful operations (updateStateByKey
) it is rather unlikely you want to use this with growing collections.
Upvotes: 3