Reputation: 197
when using the pyspark to handle with the data, I want to calculate two attributes of a word. For example, the data looks like:
("word1", (1, 2))
("word1", (2, 3))
("word2", (3, 4))
("word2", (5, 6))
And I want to aggregate them to be like:
("word1", (3, 5))
("word2", (8, 10))
which means that combine the tuple value by the word. I have tried to use
rdd.reduceByKey(lambda: a, b:(a[0] + b[0], a[1], b[1]))
But it doesn't work. What should I do to handle such data structure with pyspark.rdd? Thanks!
Upvotes: 0
Views: 29
Reputation: 6082
You're almost there
rdd.reduceByKey(lambda a, b: (a[0] + b[0], a[1] + b[1])).collect()
# [('word1', (3, 5)), ('word2', (8, 10))]
Upvotes: 1