Pyspark RDD aggregate different value fields differently

Question

This is a pretty open ended question, but I have an RDD in this format.

[('2014-06', ('131313', 5.5, 6.5, 7.5, 10.5 )),
('2014-07', ('246655', 636636.53, .53252, 5252.112, 5242.23)),
('2014-06', ('131232', 1, 2, 4.5, 5.5)),
('2014-07', ('131322464', 536363.6363, 536336.6363, 3563.63636, 9.6464464646464646))]

I want to group by and aggregate each of the values differently by the key. For example for the key '2014-06' I want to get the count of the first value field i.e '131313' and the average for the other fields 5.5, 6.5, 7.5, 10.5 for the key '2014-06'.

So the result for the above simple example for key '2014-06' would be ('2014-06', (2, 3.25, 5.5, 8)).

What would be the best method to do this for an RDD? I cannot use any Spark SQL expressions or functions only RDD functions.

I was thinking about doing something with mapValues and using some other function but I am having some trouble formulating this function.

I know this questions is pretty open ended so please let me know if you have any more questions.

Thank you for your time.

jxc · Accepted Answer

One way is to use map() method to convert the first value to 1 (for record counting) and then use reduceByKey() to sum each value with the same key. finally, use mapValues() to calculate the mean values except the first one which is the count(keep as-is).

rdd.map(lambda x: (x[0], (1, *x[1][1:]))) \
   .reduceByKey(lambda x,y: tuple([x[i]+y[i] for i in range(len(x))])) \
   .mapValues(lambda x: (x[0], *[ e/x[0] for e in x[1:]]))

After map():

[('2014-06', (1, 5.5, 6.5, 7.5, 10.5)),
 ('2014-07', (1, 636636.53, 0.53252, 5252.112, 5242.23)),
 ('2014-06', (1, 1, 2, 4.5, 5.5)),
 ('2014-07', (1, 536363.6363, 536336.6363, 3563.63636, 9.646446464646464))]

After reduceByKey():

[('2014-06', (2, 6.5, 8.5, 12.0, 16.0)),
 ('2014-07',
  (2, 1173000.1663000002, 536337.16882, 8815.74836, 5251.876446464646))]

After mapValues():

[('2014-06', (2, 3.25, 4.25, 6.0, 8.0)),
 ('2014-07',
  (2, 586500.0831500001, 268168.58441, 4407.87418, 2625.938223232323))]

Pyspark RDD aggregate different value fields differently

Answers (2)

Related Questions