CoMacNo
CoMacNo

Reputation: 41

Counting number of occurrences of Array element in a RDD

I have a RDD1 with Key-Value pair of type [(String, Array[String])] (i will refer to them as (X, Y)), and a Array Z[String]. I'm trying for every element in Z to count how many X instances there are that have Z in Y. I want my output as ((X, Z(i)), #ofinstances).

RDD1= ((A, (2, 3, 4), (B, (4, 4, 4)), (A, (4, 5)))
Z = (1, 4)

then i want to get:

(((A, 4), 2), ((B, 4), 1))

Hope that made sense. As you can see over, i only want an element if there is atleast one occurence.

I have tried this so far:

val newRDD = RDD1.map{case(x, y) => for(i <- 0 to (z.size-1)){if(y.contains(z(i))) {((x, z(i)), 1)}}}

My output here is an RDD[Unit]

Im not sure if what i'm asking for is even possible, or if i have to do it an other way.

Upvotes: 0

Views: 2493

Answers (1)

Alper t. Turker
Alper t. Turker

Reputation: 35229

So it is just another word count

val rdd = sc.parallelize(Seq(
   ("A", Array("2", "3", "4")), 
   ("B", Array("4", "4", "4")),
   ("A", Array("4", "5"))))

val z = Array("1", "4")

To make lookups efficient convert z to Set:

val zs = z.toSet

val result = rdd
  .flatMapValues(_.filter(zs contains _).distinct)
  .map((_, 1))
  .reduceByKey(_ + _)

where

_.filter(zs contains _).distinct

filters out values that occur in z and deduplicates.

result.take(2).foreach(println)
// ((B,4),1)
// ((A,4),2)

Upvotes: 1

Related Questions