Reputation: 830
I am aggregating values by parameter as below using apache-spark and scala. This keeps adding values to "List" Is there more efficient way to get list by key and StatCounter?
val predictorRawKey = predictorRaw.map { x =>
val param = x._1
val val: Double = x._2.toDouble
(param, val)
}.mapValues(num => List( num) )
.reduceByKey((l1, l2) => l1 ::: l2)
.map { x => x._1, StatCounter(x._2.iterator))
Upvotes: 1
Views: 397
Reputation: 330193
For starters you shouldn't use reduceByKey
to group values. It is more efficient to omit map side aggregation and use groupByKey
directly.
Fortunately StatCounter
can work in a streaming fashion and there is no need to group values at all:
import org.apache.spark.util.StatCounter
val pairs = predictorRawKey.map(x => (x._1, x._2.toDouble))
val predictorRawKey = pairs.aggregateByKey(StatCounter(Nil))(
(acc: StatCounter, x: Double) => acc.merge(x),
(acc1: StatCounter, acc2: StatCounter) => acc1.merge(acc2)
)
Upvotes: 1