PySpark: How to Split String Value in Paired RDD and Map with Key

Question

Given

data = sc.parallelize([(1,'winter is coming'),(2,'ours is the fury'),(3,'the old the true the brave')])

My desired output is

[('fury',[2],('true',[3]),('is',[1,2]),('old',[3]),('the',[2,3]),('ours',[2]),('brave',[3]),('winter',[1]),('coming',[1])]

I'm not sure how to map the following output

[(1,'winter'),(1,'is'),(1,'coming'),(2,'ours'),(2,'is'),....etc.]`

I tried using

data.flatMap(lambda x: [(x[0], v) for v in x[1]]

but this ended up mapping the key to each letter of the string instead of the word. Should flatMap, map or split function be used here?

After mapping, I plan to reduce the paired RDDs with similar keys and inverse key and value by using

data.reduceByKey(lambda a,b: a+b).map(lambda x:(x[1],x[0])).collect()

Is my thinking correct?

ernest_k · Accepted Answer

You can flatMap and create tuples where keys are reused and an entry is created for each word (obtained using split()):

data.flatMap(lambda pair: [(pair[0], word) for word in pair[1].split()])

When collected, that outputs

[(1, 'winter'),
 (1, 'is'),
 (1, 'coming'),
 (2, 'ours'),
 (2, 'is'),
 (2, 'the'),
 (2, 'fury'),
 (3, 'the'),
 (3, 'old'),
 (3, 'the'),
 (3, 'true'),
 (3, 'the'),
 (3, 'brave')]

PySpark: How to Split String Value in Paired RDD and Map with Key

Answers (1)

Related Questions