How to use pyspark.rdd to combine the data format like ("word", (1, 2))?

Question

when using the pyspark to handle with the data, I want to calculate two attributes of a word. For example, the data looks like:

("word1", (1, 2))
("word1", (2, 3))
("word2", (3, 4))
("word2", (5, 6))

And I want to aggregate them to be like:

("word1", (3, 5))
("word2", (8, 10))

which means that combine the tuple value by the word. I have tried to use

rdd.reduceByKey(lambda: a, b:(a[0] + b[0], a[1], b[1]))

But it doesn't work. What should I do to handle such data structure with pyspark.rdd? Thanks!

pltc · Accepted Answer

You're almost there

rdd.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1])).collect()

# [('word1', (3, 5)), ('word2', (8, 10))]

Answers (1)