fiticida
fiticida

Reputation: 694

Sum tuples values to calculate mean - RDD

I have the following tuple.

#                                 x           y        z
[(('a', 'nexus4', 'stand'), ((-5.958191, 0.6880646, 8.135345), 1))]
#           part A (key)               part B (value)         count

As you can see, I have a tuple which is my Key(PART A), I have another tuple which is my Value (PART B) and the number which is my count of different values from my Key Part.

My code for doing this is the following one.

# Cargo los datos
lectura = sc.textFile("asdasd.csv")

datos = lectura.map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(float(x.split(",")[3]),float(x.split(",")[4]), float(x.split(",")[5])))) 

meanRDD = (datos.mapValues(lambda x: (x, 1)))

Ok, now I want to SUM all the values that have the same KEY, to calculate the MEAN from X column, Y column or Z column.

I think I can do it by using reduceByKey, but I'm not applying this function correctly.

Example of my code that is not working:

sum = meanRDD.reduceByKey(lambda x, y: (x[0][0] + y[0][1],x[0][1] + y[1][1], x[0][2] + y[1][2]))

I know after that I have to apply another MapValues function to divide my values by my count part, but the sum isn't working correctly.

example "asdasd.csv" file

 Index,Arrival_Time,Creation_Time,x,y,z,User,Model,Device,gt
0,1424696633908,1424696631913248572,-5.958191,0.6880646,8.135345,a,nexus4,nexus4_1,stand
1,1424696633909,1424696631918283972,-5.95224,0.6702118,8.136536,a,nexus4,nexus4_1,stand
2,1424696633918,1424696631923288855,-5.9950867,0.6535491999999999,8.204376,a,nexus4,nexus4_1,stand
3,1424696633919,1424696631928385290,-5.9427185,0.6761626999999999,8.128204,a,nexus4,nexus4_1,stand

My key is a tuple (Model, device, gt) my value is (x,y,z)

Any idea?

Upvotes: 0

Views: 1571

Answers (1)

Ramesh Maharjan
Ramesh Maharjan

Reputation: 41957

Below is the complete solution using reduceByKey

lectura = sc.textFile("asdasd.csv")

datos = lectura.map(lambda x: ((x.split(",")[6], x.split(",")[7], x.split(",")[9]),(float(x.split(",")[3]),float(x.split(",")[4]), float(x.split(",")[5]))))

meanRDD = datos.mapValues(lambda x: (x, 1))\
               .reduceByKey(lambda ((x1, y1, z1), a1), ((x2, y2, z2), a2): ((x1+x2, y1+y2, z1+z2), a1+a2))\
               .mapValues(lambda ((x, y, z), sum): (x/float(sum), y/float(sum), z/float(sum)))

Upvotes: 1

Related Questions