Reputation: 694
I have the following tuple.
# x y z
[(('a', 'nexus4', 'stand'), ((-5.958191, 0.6880646, 8.135345), 1))]
# part A (key) part B (value) count
As you can see, I have a tuple which is my Key(PART A), I have another tuple which is my Value (PART B) and the number which is my count of different values from my Key Part.
My code for doing this is the following one.
# Cargo los datos
lectura = sc.textFile("asdasd.csv")
datos = lectura.map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(float(x.split(",")[3]),float(x.split(",")[4]), float(x.split(",")[5]))))
meanRDD = (datos.mapValues(lambda x: (x, 1)))
Ok, now I want to SUM all the values that have the same KEY, to calculate the MEAN from X column, Y column or Z column.
I think I can do it by using reduceByKey, but I'm not applying this function correctly.
Example of my code that is not working:
sum = meanRDD.reduceByKey(lambda x, y: (x[0][0] + y[0][1],x[0][1] + y[1][1], x[0][2] + y[1][2]))
I know after that I have to apply another MapValues function to divide my values by my count part, but the sum isn't working correctly.
example "asdasd.csv" file
Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand
1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand
2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand
3,1424696633919,1424696631928385290,-5.9427185,0.6761626999999999,8.128204,a,nexus4,nexus4_1,stand
My key is a tuple (Model, device, gt) my value is (x,y,z)
Any idea?
Upvotes: 0
Views: 1571
Reputation: 41957
Below is the complete solution using reduceByKey
lectura = sc.textFile("asdasd.csv")
datos = lectura.map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(float(x.split(",")[3]),float(x.split(",")[4]), float(x.split(",")[5]))))
meanRDD = datos.mapValues(lambda x: (x, 1))\
.reduceByKey(lambda ((x1, y1, z1), a1), ((x2, y2, z2), a2): ((x1+x2, y1+y2, z1+z2), a1+a2))\
.mapValues(lambda ((x, y, z), sum): (x/float(sum), y/float(sum), z/float(sum)))
Upvotes: 1