Reputation: 1859
val sc = new SparkContext("local[4]", "wc")
val lines: RDD[String] = sc.textFile("/tmp/inputs/*")
val errors = lines.filter(line => line.contains("ERROR"))
// Count all the errors
println(errors.count())
the above snippet would count the number of lines that contain the word ERROR. Is there a simplified function similar to "contains" that would return me the number of occurrences of the word instead?
say the file is in terms of Gigs and i would want to parallalize the effort using spark clusters.
Upvotes: 2
Views: 2291
Reputation: 37435
Just count the instances per line and sum those together:
val errorCount = lines.map{line => line.split("[\\p{Punct}\\p{Space}]").filter(_ == "ERROR").size}.reduce(_ + _)
Upvotes: 1