Siva
Siva

Reputation: 1859

count occurances of each word in apache spark

  val sc = new SparkContext("local[4]", "wc")

  val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
  val errors = lines.filter(line => line.contains("ERROR"))

  // Count all the errors
  println(errors.count())

the above snippet would count the number of lines that contain the word ERROR. Is there a simplified function similar to "contains" that would return me the number of occurrences of the word instead?

say the file is in terms of Gigs and i would want to parallalize the effort using spark clusters.

Upvotes: 2

Views: 2291

Answers (1)

maasg
maasg

Reputation: 37435

Just count the instances per line and sum those together:

val errorCount = lines.map{line => line.split("[\\p{Punct}\\p{Space}]").filter(_ == "ERROR").size}.reduce(_ + _)

Upvotes: 1

Related Questions