ssuperczynski
ssuperczynski

Reputation: 3416

How do I get Avg and Sum in Spark RDD

Given I have Spark function:

val group = whereRdd.map(collection => collection.getLong("location_id") -> collection.getInt("feel"))
  .groupByKey
  .map(grouped => grouped._1 -> grouped._2.toSet)

group.foreach(g => println(g))

I am getting:

(639461796080961,Set(15))
(214680441881239,Set(5, 10, 25, -99, 99, 19, 100))
(203328349712668,Set(5, 10, 15, -99, 99, 15, 10))

Is it possible to add to this function a Map(), and place avg and sum of each Sets? For example:

(639461796080961,Map("data" -> Set(5, 10, 25, -99, 99, 19, 100), "avg" -> 22.71, "sum" -> 159))

Upvotes: 2

Views: 2863

Answers (2)

zero323
zero323

Reputation: 330063

One thing I would recommend is to use a Tuple or case class instead of Map. I mean roughly something like this:

case class Location(id: Long, values: Set[Int], sum: Int, avg: Double)

val group = whereRdd
  .map(collection => 
    collection.getLong("location_id") -> collection.getInt("feel"))
  .groupByKey
  .map{case (id, values) => {
    val set = values.toSet
    val sum = set.sum
    val mean = sum / set.size.toDouble
    Location(id, set, sum, mean)
  }}

The biggest advantage over Map is that it keeps the types in order.

Upvotes: 3

ssuperczynski
ssuperczynski

Reputation: 3416

After reading @zero323 answer I added Map() and it works:

val group = whereRdd.map(collection => collection.getLong("location_id") -> collection.getInt("feel"))
  .groupByKey
  .map(grouped => grouped._1 -> Map(
    "data" -> grouped._2.toSet,
    "sum" -> grouped._2.toSet.sum,
    "amount" -> grouped._2.toSet.size,
    "avg" -> grouped._2.toSet.sum.asInstanceOf[Int] / grouped._2.toSet.size
  ))

group.foreach(g => println(g))

And I am getting:

(193809797319052,Map(data -> Set(5, 10, 25, -99, 99, 15, 100), sum -> 155, amount -> 7, avg -> 22))

Upvotes: 2

Related Questions