running Map on a nested RDD

Question

I have a huge RDD with hundreds of millions of records that its structure is like this: rdd.collect() results in:

[['t1', 't5', 't7'],
['t1', 't7'], ['t2', 't3'],
['t1', 't3', 't4'], ...]

And suppose this is all of the RDD and not a snapshot of it. What I want to have is this: desired.collect() results in:

('t1', )
('t5', )
('t7', )
('t2', )
('t3', )
('t4', )

Where for example:

for x in desired.collect()[0][1]:
  print(x)

results in:

t5
t7
t3
t4

because they share a list with t1.

Anjaneya Tripathi · Accepted Answer

Continuing from your last question we can use same method along with groupbykey and mapvalues

import itertools
data2=[['t1', 't5', 't7'],['t1', 't7'], ['t2', 't3'],['t1', 't3', 't4']]
rd2=sc.parallelize(data2)
rd2=rd2.flatMap(lambda x:itertools.combinations(x,2)).groupByKey().mapValues(set)

for x in rd2.collect()[0][1]:
  print(x)

#output
t5
t7
t3
t4

Edit - Have added .mapValues(set) to remove duplicates from results

running Map on a nested RDD

Answers (1)

Related Questions