cpd1
cpd1

Reputation: 789

Performing a map on a tuple in pyspark

I managed to get done what I need to do using Spark's Mllib (handling different than below/not related) but I'm wondering if there is any other way of accomplishing what I want to do.

I have data like this...

[(0, ([7, 6, 1, 4, 5, 4, 4, 3, 7, 0], [2])), (8, ([7, 4, 8, 2, 2, 0, 2, 6, 4, 0], [7]))

Where I joined two different lists after I used zipWithIndex on both.

I'd like to process the above to be...

[(0, 7 * 2), (0, 6 * 2), (0, 1 * 2) ... etc

Where the joined zip index value is the key and the value is the product of each element in the first list with the only element in the second list.

Would something like that be doable?

Upvotes: 1

Views: 3923

Answers (1)

akuiper
akuiper

Reputation: 214957

You can use flatMap and for each element, return a list of tuples:

rdd.flatMap(lambda x: [(x[0], i * x[1][1][0]) for i in x[1][0]]).collect()

# [(0, 14), (0, 12), (0, 2), (0, 8), (0, 10), (0, 8), (0, 8), (0, 6), (0, 14), (0, 0), (8, 49), (8, 28), (8, 56), (8, 14), (8, 14), (8, 0), (8, 14), (8, 42), (8, 28), (8, 0)]

To make this clearer, write a normal method for the mapping:

def list_mul(t):
    k, (l1, l2) = t
    return [(k, i*l2[0]) for i in l1]

rdd.flatMap(list_mul).collect()
# [(0, 14), (0, 12), (0, 2), (0, 8), (0, 10), (0, 8), (0, 8), (0, 6), (0, 14), (0, 0), (8, 49), (8, 28), (8, 56), (8, 14), (8, 14), (8, 0), (8, 14), (8, 42), (8, 28), (8, 0)]

Upvotes: 3

Related Questions