Reputation: 43
I am trying to reduce an RDD with 3 values, so, at first, i map the rdd with the following format
a = mytable.rdd.map(lambda w: (w.id,(w.v1,w.v2,w.v3)))
and then in the next step i reduce it with the following code
b = a.reduceByKey(lambda a,b,c: (a[0] +','+ a[1],b[0] +','+ b[1],c[0] +','+ c[1]))
However, i get an error which is: TypeError: () takes exactly 3 arguments (2 given)
My goal is to add all the values of that rdd, so for example if my rdd having these values:
[(id1, ('a','b','c')),(id1', ('e','f','g'))]
After reduce i want the results to be in this order:
[(id1, ('a,d','b,e','c,f'))]
Thanks
Upvotes: 0
Views: 468
Reputation: 330413
An optimal solution can be expressed as:
a.groupByKey().mapValues(lambda vs: [",".join(v) for v in zip(*vs)])
where initial groupByKey
groups data into a structure equivalent to:
('id1', [('a','b','c'), ('e','f','g')])
zip(*vs)
transposes values to:
[('a', 'e'), ('b', 'f'), ('c', 'g')]
and comprehension with join
concatenates each tuple.
reduceByKey
is really not the right choice (think about complexity) here but in general it takes a function of two arguments so lambda a, b, c: ...
is wont' do. I believe you wanted something like this:
lambda a, b: (a[0] + "," + b[0], a[1] + "," + b[1], a[2] + "," + b[2])
Upvotes: 2