Reputation: 159
I have a problem with Spark Scala which I want count the average from the dstream data,I get data from kafka to dstream like this,
[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]
I want to count them like this,
[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]
then,get the result like this,
[(2,120),(3,204),(4,160)]
How can I do this with scala from dstream? I use spark version 1.6
Upvotes: 0
Views: 116
Reputation: 1181
Use map to convert input (x, y) -
[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]
to (x, (y, 1)
[(2,(110, 1)),(2,(130, 1)),(2,(120, 1)),(3,(200, 1)),(3,(206, 1)),(3,(206, 1)),(4,(150, 1)),(4,(160, 1),(4,(170, 1))]
Now, use redudceByKeyAndWindow by writing a reduce function which will add two records as - (x, (y1, 1)) and (x,(y2, 1)) to (x, (y1+y2, 1+1)
[(2,(360, 3)),(3,(612, 3)),(4,(480, 3))]
Run a map again to get the average now - (x, (y1, y2)) to (x, (y1/y2))
[(2,120),(3,204),(4,160)]
Upvotes: 3