JarsOfJam-Scheduler
JarsOfJam-Scheduler

Reputation: 3149

Using Apache Spark, how to count the occurrences of each pair in a Scala Array

I need to count the occurrences of each pair that is present in a Scala Array, and the later is distributed. So :

  1. I must count the occurrences of each pair that is present on the RDDs of my cluster's nodes (i.e. : "on each part of the distributed Array"). It means I will have x results^1, with x : the number of my cluster's nodes.

  2. Then, the driver must add up the results, to know the distributed Array's number of occurrences of each pair.

^1 : please, note that one result is a cluster's node count of each pair of its own part of the distributed Array. I think a HashMap would be fine to use there. By the way, a HashMap will be used by the driver too. The driver will have to sum up each case of its HashMap with the corresponding case of the HashMaps that it receives from the cluster nodes.

ILLUSRATION : enter image description here

Upvotes: 1

Views: 685

Answers (1)

Danny Mor
Danny Mor

Reputation: 1263

seems like you need the "reduceByKeyLocally" :

    val result: collection.Map[(String, String), Int] = context
      .parallelize(Seq(("BLUE", "RED"), ("RED", "GREEN"), ("YELLOW", "ORANGE")))
      .map(colorPair => (colorPair, 1))
      .reduceByKeyLocally(_ + _)

reduceByKeyLocally map locally first, merge locally (using foreachPartition) and call the action of reduce

Upvotes: 1

Related Questions