Get sum and length of rdd column using groupBy?

Question

I have the following RDD:

[(1, 300), (4, 60), (4, 20), (2, 2), (2, 3), (2, 5)]

My expected RDD is:

[(1,[300, 1]), (2,[10, 3]), (4,[80,2])]

The first value in the list within the tuple is the sum(e.g. for 2: its 2+3+5 = 10) and second value is the no. of occurrences (e.g. 1 occurs once). Can the expected RDD be achieved using groupBy function?

mck · Accepted Answer

You can map each value to a list [x, 1], then sum all the lists for each key.

rdd = sc.parallelize([(1, 300), (4, 60), (4, 20), (2, 2), (2, 3), (2, 5)])

result = rdd.mapValues(lambda x: [x, 1]).reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])

result.collect()
# [(1, [300, 1]), (2, [10, 3]), (4, [80, 2])]

Get sum and length of rdd column using groupBy?

Answers (1)

Related Questions