Reputation: 55
Given
data = sc.parallelize([(1,'winter is coming'),(2,'ours is the fury'),(3,'the old the true the brave')])
My desired output is
[('fury',[2],('true',[3]),('is',[1,2]),('old',[3]),('the',[2,3]),('ours',[2]),('brave',[3]),('winter',[1]),('coming',[1])]
I'm not sure how to map the following output
[(1,'winter'),(1,'is'),(1,'coming'),(2,'ours'),(2,'is'),....etc.]`
I tried using
data.flatMap(lambda x: [(x[0], v) for v in x[1]]
but this ended up mapping the key to each letter of the string instead of the word. Should flatMap, map or split function be used here?
After mapping, I plan to reduce the paired RDDs with similar keys and inverse key and value by using
data.reduceByKey(lambda a,b: a+b).map(lambda x:(x[1],x[0])).collect()
Is my thinking correct?
Upvotes: 2
Views: 1002
Reputation: 45339
You can flatMap
and create tuples where keys are reused and an entry is created for each word (obtained using split()
):
data.flatMap(lambda pair: [(pair[0], word) for word in pair[1].split()])
When collected, that outputs
[(1, 'winter'),
(1, 'is'),
(1, 'coming'),
(2, 'ours'),
(2, 'is'),
(2, 'the'),
(2, 'fury'),
(3, 'the'),
(3, 'old'),
(3, 'the'),
(3, 'true'),
(3, 'the'),
(3, 'brave')]
Upvotes: 1