pyspark - flatten nested rdd using lambda expression with map()

Question

To flatten nested lists I've always successfully used either a list comprehension, or itertools.chain.from_iterable.

Using pyspark, however, I need to flatten a list of lists (of tupples) by mapping a lambda function, and for some reason I can't 'convert' this successful list comprehension:

z = [[(1,2),(2,3),(3,4)],[(5,6),(7,8),(9,10)]]

[(j,k) for sublist in z for j,k in sublist]
[(1, 2), (2, 3), (3, 4), (5, 6), (7, 8), (9, 10)] # right answer

To the equivalent map / lambda equivalent:

list(map(lambda z: [(j,k) for sublist in z for j,k in sublist],z))
TypeError: 'int' object is not iterable

This is driving me crazy! What am I doing wrong?

Moinuddin Quadri · Accepted Answer

I will suggest you to use itertools.chain:

from itertools import chain

list(chain.from_iterable(z))

OR, you may also use sum() as:

sum(z, [])

However, if it is must to use lambda expression, then it could be used with reduce as:

list(reduce(lambda x, y: x+y, z))

Value returned by each of the above expression will be:

[(1, 2), (2, 3), (3, 4), (5, 6), (7, 8), (9, 10)]

pyspark - flatten nested rdd using lambda expression with map()

Answers (2)

Related Questions