Reduce operation on Spark

Question

I am trying to reduce an RDD with 3 values, so, at first, i map the rdd with the following format

a = mytable.rdd.map(lambda w: (w.id,(w.v1,w.v2,w.v3)))

and then in the next step i reduce it with the following code

b = a.reduceByKey(lambda a,b,c: (a[0] +','+ a[1],b[0] +','+ b[1],c[0] +','+ c[1]))

However, i get an error which is: TypeError: () takes exactly 3 arguments (2 given)

My goal is to add all the values of that rdd, so for example if my rdd having these values:

[(id1, ('a','b','c')),(id1', ('e','f','g'))]

After reduce i want the results to be in this order:

[(id1, ('a,d','b,e','c,f'))]

Thanks

zero323 · Accepted Answer

An optimal solution can be expressed as:

a.groupByKey().mapValues(lambda vs: [",".join(v) for v in  zip(*vs)])

where initial groupByKey groups data into a structure equivalent to:

('id1', [('a','b','c'), ('e','f','g')])

zip(*vs) transposes values to:

[('a', 'e'), ('b', 'f'), ('c', 'g')]

and comprehension with join concatenates each tuple.

reduceByKey is really not the right choice (think about complexity) here but in general it takes a function of two arguments so lambda a, b, c: ... is wont' do. I believe you wanted something like this:

lambda a, b: (a[0] + "," + b[0], a[1] + "," + b[1], a[2] + "," + b[2])

Reduce operation on Spark

Answers (1)

Related Questions