P-S
P-S

Reputation: 4016

Pyspark | Transform RDD from key with list of values > values with list of keys

In pyspark, how to transform an input RDD where Every Key has a list of Values to an output RDD where Every Value has a list of Keys it belong to?

Input

[(1, ['a','b','c','e']), (2, ['b','d']), (3, ['a','d']), (4, ['b','c'])]

Output

[('a', [1, 3]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2,3]), ('e', [1])]

Upvotes: 0

Views: 66

Answers (1)

akuiper
akuiper

Reputation: 214927

Flatten and swap the key value on the rdd first, and then groupByKey:

rdd.flatMap(lambda r: [(k, r[0]) for k in r[1]]).groupByKey().mapValues(list).collect()
# [('a', [1, 3]), ('e', [1]), ('b', [1, 2, 4]), ('c', [1, 4]), ('d', [2, 3])]

Upvotes: 3

Related Questions