Reputation: 3149
I need to count the occurrences of each pair that is present in a Scala Array
, and the later is distributed. So :
I must count the occurrences of each pair that is present on the RDD
s of my cluster's nodes (i.e. : "on each part of the distributed Array
"). It means I will have x results^1, with x : the number of my cluster's nodes.
Then, the driver must add up the results, to know the distributed Array
's number of occurrences of each pair.
^1 : please, note that one result is a cluster's node count of each pair of its own part of the distributed Array
. I think a HashMap
would be fine to use there. By the way, a HashMap
will be used by the driver too. The driver will have to sum up each case of its HashMap
with the corresponding case of the HashMap
s that it receives from the cluster nodes.
Upvotes: 1
Views: 685
Reputation: 1263
seems like you need the "reduceByKeyLocally" :
val result: collection.Map[(String, String), Int] = context
.parallelize(Seq(("BLUE", "RED"), ("RED", "GREEN"), ("YELLOW", "ORANGE")))
.map(colorPair => (colorPair, 1))
.reduceByKeyLocally(_ + _)
reduceByKeyLocally map locally first, merge locally (using foreachPartition) and call the action of reduce
Upvotes: 1