Reputation: 51
This question is regarding groupByKey() in spark using scala.
Consider below data
Name,marks,value
Chris,30,1
Chris,35,1
Robert,12,1
Robert,20,1
Created below rdd
val dataRDD = sc.parallelize(List(("Chris",30,1),("Chris",35,1),("Robert",12,1),("Robert",20,1)))
I am trying to create a key value pair of this like
val kvRDD = dataRDD.map(rec=> (rec._1, (rec._2,rec._3)))
Now I want sum of both the values.
val sumRDD = kvRDD.groupByKey().map(rec => (rec._1,(rec._2._1.sum, rec._2._2.sum)))
However, I am facing below error.
<console>:28: error: value _2 is not a member of Iterable[(Int, Int)]
Can't we achieve the required using groupByKey
?
Upvotes: 3
Views: 125
Reputation: 1642
It is recommended to use reduceByKey in such scenario but still if you want to do it using groupByKey you can try the below approach. I am doing it python way you can try the same with scala.
def summly(ilist):
sum1=0
sum2=0
for i in ilist:
sum1=sum1+i[0]
sum2=sum2+i[1]
return (sum1,sum2)
sumRDD = kvRDD.groupByKey().map(lambda x : (x[0],summly(list(x[1])))
Upvotes: 1
Reputation: 1892
The value of kvRDD
is array
of tuple
so you can sum array
values directly, You can do like below
val sumRDD=kvRDD.groupByKey.map(rec=>(rec._1,(rec._2.map(_._1).sum,rec._2.map(_._2).sum)))
//Output
scala> sumRDD.collect
res11: Array[(String, (Int, Int))] = Array((Robert,(32,2)), (Chris,(65,2)))
Upvotes: 1
Reputation: 22439
Rather than groupByKey
, I would suggest using the more efficient reduceByKey
:
val dataRDD = sc.parallelize(Seq(
("Chris",30,1), ("Chris",35,1), ("Robert",12,1), ("Robert",20,1)
))
val kvRDD = dataRDD.map(rec => (rec._1, (rec._2, rec._3)))
val sumRDD = kvRDD.reduceByKey{ (acc, t) =>
(acc._1 + t._1, acc._2 + t._2)
}
sumRDD.collect
// res1: Array[(String, (Int, Int))] = Array((Robert,(32,2)), (Chris,(65,2)))
Upvotes: 1