xiaodelaoshi
xiaodelaoshi

Reputation: 631

How to count occurrences of a word inside a Array in scala when using spark?

In spark, I have a RDD which is :

textRDD[Text]

Here "Text" is a class with:

(String, Array[String])

I wish to know in this RDD how many strings in the array are the same as my target string, so is there any way to count the number of that?

The first thing I have is something like:

val count = textRDD.aggregate(0)(seqOp, (acc1, acc2) => acc1 + acc2))

But I have no idea about how to design seqOp in this case.

For example:

textRDD = sc.parallelize(
     List(
         ("s1", Array("this", "is", "a", "sentence", "about", math)),
         ("s2", Array("math", "is", "an", "english", "word", "in" "english")),
         ("s3", Array("computer", "science", "is", "a", "science", "with", "math"))
         )

How can I count the num of "math"?

Upvotes: 1

Views: 2606

Answers (1)

Harald Gliebe
Harald Gliebe

Reputation: 7564

You can count the occurrences of a String s in an Array[String] a with the count method:

a.count(_ == s)

For the seqOp in your case above, you would need to add the accumulator:

val count = textRDD.aggregate(0)((acc, pair) => acc + pair._2.count(_ == pair._1), (acc1, acc2) => acc1 + acc2))

I'm not sure if that's what you want exactly, this aggregate would count the sum of all the occurrences of the keys in the arrays, not separated by key.

If you want the occurrences per key, you could use aggregateByKey:

textRDD.map(x => (x._1, (x._1,x._2)))
  .aggregateByKey(0)((acc, pair) => acc+ pair._2.count(_ == pair._1),(a1, a2) => a1 + a2)

Upvotes: 3

Related Questions