Reputation: 789
I managed to get done what I need to do using Spark's Mllib (handling different than below/not related) but I'm wondering if there is any other way of accomplishing what I want to do.
I have data like this...
[(0, ([7, 6, 1, 4, 5, 4, 4, 3, 7, 0], [2])), (8, ([7, 4, 8, 2, 2, 0, 2, 6, 4, 0], [7]))
Where I joined two different lists after I used zipWithIndex on both.
I'd like to process the above to be...
[(0, 7 * 2), (0, 6 * 2), (0, 1 * 2) ... etc
Where the joined zip index value is the key and the value is the product of each element in the first list with the only element in the second list.
Would something like that be doable?
Upvotes: 1
Views: 3923
Reputation: 214957
You can use flatMap
and for each element, return a list of tuples:
rdd.flatMap(lambda x: [(x[0], i * x[1][1][0]) for i in x[1][0]]).collect()
# [(0, 14), (0, 12), (0, 2), (0, 8), (0, 10), (0, 8), (0, 8), (0, 6), (0, 14), (0, 0), (8, 49), (8, 28), (8, 56), (8, 14), (8, 14), (8, 0), (8, 14), (8, 42), (8, 28), (8, 0)]
To make this clearer, write a normal method for the mapping:
def list_mul(t):
k, (l1, l2) = t
return [(k, i*l2[0]) for i in l1]
rdd.flatMap(list_mul).collect()
# [(0, 14), (0, 12), (0, 2), (0, 8), (0, 10), (0, 8), (0, 8), (0, 6), (0, 14), (0, 0), (8, 49), (8, 28), (8, 56), (8, 14), (8, 14), (8, 0), (8, 14), (8, 42), (8, 28), (8, 0)]
Upvotes: 3