How can I count the average per Key or Grouping of records from Spark Streaming DStream?

Question

I have a problem with Spark Scala which I want count the average from the dstream data,I get data from kafka to dstream like this,

[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]

I want to count them like this,

[(2,(110+130+120)/3),(3,(200+206+206)/3),(4,(150+160+170)/3)]

then,get the result like this,

[(2,120),(3,204),(4,160)]

How can I do this with scala from dstream? I use spark version 1.6

Ajay Srivastava · Accepted Answer

Use map to convert input (x, y) -

[(2,110),(2,130),(2,120),(3,200),(3,206),(3,206),(4,150),(4,160),(4,170)]

to (x, (y, 1)

[(2,(110, 1)),(2,(130, 1)),(2,(120, 1)),(3,(200, 1)),(3,(206, 1)),(3,(206, 1)),(4,(150, 1)),(4,(160, 1),(4,(170, 1))]

Now, use redudceByKeyAndWindow by writing a reduce function which will add two records as - (x, (y1, 1)) and (x,(y2, 1)) to (x, (y1+y2, 1+1)

[(2,(360, 3)),(3,(612, 3)),(4,(480, 3))]

Run a map again to get the average now - (x, (y1, y2)) to (x, (y1/y2))

[(2,120),(3,204),(4,160)]

Answers (1)