Darshan
Darshan

Reputation: 81

Count operation in reduceByKey in spark

val temp1 = tempTransform.map({ temp => ((temp.getShort(0), temp.getString(1)), (USAGE_TEMP.getDouble(2), USAGE_TEMP.getDouble(3)))})
  .reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))

Here I have performed Sum operation But Is it possible to do count operation inside reduceByKey.

Like what i think,

reduceByKey((x, y) => (math.count(x._1),(x._2+y._2)))

But this is not working any suggestion please.

Upvotes: 0

Views: 5125

Answers (2)

Tzach Zohar
Tzach Zohar

Reputation: 37852

Well, counting is equivalent to summing 1s, so just map the first item in each value tuple into 1 and sum both parts of the tuple like you did before:

val temp1 = tempTransform.map { temp => 
   ((temp.getShort(0), temp.getString(1)), (1, USAGE_TEMP.getDouble(3)))
}
.reduceByKey((x, y) => ((x._1+y._1),(x._2+y._2)))

Result would be an RDD[((Short, String), (Int, Double))] where the first item in the value tuple (the Int) is the number of original records matching that key.

That's actually the classic map-reduce example - word count.

Upvotes: 1

lege
lege

Reputation: 158

No, you can't do that. RDD provide iterator model for lazy computation. So every element will be visited only once.

If you really want to do sum as described, re-partition your rdd first, then use mapWithPartition, implement your calculation in closure( Keep in mind that elements in RDD is not in order).

Upvotes: 0

Related Questions