Reputation: 3416
Given I have Spark
function:
val group = whereRdd.map(collection => collection.getLong("location_id") -> collection.getInt("feel"))
.groupByKey
.map(grouped => grouped._1 -> grouped._2.toSet)
group.foreach(g => println(g))
I am getting:
(639461796080961,Set(15))
(214680441881239,Set(5, 10, 25, -99, 99, 19, 100))
(203328349712668,Set(5, 10, 15, -99, 99, 15, 10))
Is it possible to add to this function a Map()
, and place avg
and sum
of each Sets? For example:
(639461796080961,Map("data" -> Set(5, 10, 25, -99, 99, 19, 100), "avg" -> 22.71, "sum" -> 159))
Upvotes: 2
Views: 2863
Reputation: 330063
One thing I would recommend is to use a Tuple
or case class instead of Map
. I mean roughly something like this:
case class Location(id: Long, values: Set[Int], sum: Int, avg: Double)
val group = whereRdd
.map(collection =>
collection.getLong("location_id") -> collection.getInt("feel"))
.groupByKey
.map{case (id, values) => {
val set = values.toSet
val sum = set.sum
val mean = sum / set.size.toDouble
Location(id, set, sum, mean)
}}
The biggest advantage over Map
is that it keeps the types in order.
Upvotes: 3
Reputation: 3416
After reading @zero323 answer I added Map()
and it works:
val group = whereRdd.map(collection => collection.getLong("location_id") -> collection.getInt("feel"))
.groupByKey
.map(grouped => grouped._1 -> Map(
"data" -> grouped._2.toSet,
"sum" -> grouped._2.toSet.sum,
"amount" -> grouped._2.toSet.size,
"avg" -> grouped._2.toSet.sum.asInstanceOf[Int] / grouped._2.toSet.size
))
group.foreach(g => println(g))
And I am getting:
(193809797319052,Map(data -> Set(5, 10, 25, -99, 99, 15, 100), sum -> 155, amount -> 7, avg -> 22))
Upvotes: 2