DilTeam
DilTeam

Reputation: 2661

Apache Spark: Manual counting gives different results from count() function

Can someone tell me why I don't see exact same counts printed in these two cases:

val count = myRdd.count()
println("Count from count() function: " + count)

var counters: Map[String, Int] = Map()
myRdd.foreach(i => {
  counters = counters.updated(i, counters.getOrElse(i, 0) + 1)
})

counters.foreach(i => {
  println("My key: " + i._1 + " Count: " + i._2)
})

Just trying to understand the 'map.updated' function.

Also, if I change 'counters' to:

var counters: Map[String, Long] = Map()

I get compiler error:

Error:(58, 65) type mismatch;
 found   : Long(1L)
 required: String
      counters = counters.updated(i, counters.getOrElse(i, 0) + 1L)

Why do I get this compiler error? Why is it saying that 'String' is required?

Upvotes: 0

Views: 130

Answers (1)

il.bert
il.bert

Reputation: 232

You cannot have mutable variables (var counters) shared across workers. You will end with a concurrency issue.

I suggest to refer to basic word count exercise as a starting point

Something like this

myrdd.map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)

This will give the intermediate result of your map function, is this your expected result?

Upvotes: 2

Related Questions