How to convert RDD list of lists into one list in pyspark

Question

I have an RDD object, a list of lists, that looks like this (omitted millions of sublists, only left 3 here)

my_tuples = [[('a','b'),('a','c')], 
             [('b','a'),('b','f'),('b','g')], 
             [('zzsx','c'), ('zzsx','q'), ('zzsx','m'), ('zzsx','ay'), ('zzsx','bbt')]]

and I want to convert it into a single list like this

my_list = [('a','b'),('a','c'), ('b','a'),('b','f'),('b','g'), 
           ('zzsx','c'), ('zzsx','q'), ('zzsx','m'), ('zzsx','ay'), ('zzsx','bbt')]

I can't use loops since my_tuples is an RDD object and the size of my_tuples is too large to do it. I'm new to spark, any suggestion is appreciated. Thanks.

ernest_k · Accepted Answer

You can flatten it using flatMap:

rdd.flatMap(lambda l: l)

Since your elements are list, you can just return those lists in the function, as done in the example

[('a', 'b'),
 ('a', 'c'),
 ('b', 'a'),
 ('b', 'f'),
 ('b', 'g'),
 ('zzsx', 'c'),
 ('zzsx', 'q'),
 ('zzsx', 'm'),
 ('zzsx', 'ay'),
 ('zzsx', 'bbt')]

How to convert RDD list of lists into one list in pyspark

Answers (1)

Related Questions