Gurupraveen
Gurupraveen

Reputation: 181

Spark : Difference between accumulator and local variable

While exploring Spark accumulators, I tried to understand and showcase the difference between the accumulator and regular variable in Spark. But output does not seem to match my expectation. I mean both the accumulator and counter have the same value at the end of program and am able read accumulator within transformation function (as per docs only driver can read accumulator). Am i doing something wrong? Is my understanding correct?

object Accmulators extends App {

  val spark = SparkSession.builder().appName("Accmulators").master("local[3]").getOrCreate()

  val cntAccum = spark.sparkContext.longAccumulator("counterAccum")

  val file = spark.sparkContext.textFile("resources/InputWithBlank")

  var counter = 0

  def countBlank(line:String):Array[String]={
    val trimmed = line.trim
    if(trimmed == "") {
      cntAccum.add(1)
      cntAccum.value //reading accumulator
      counter += 1
    }
    return line.split(" ")
  }

  file.flatMap(line => countBlank(line)).collect()

  println(cntAccum.value)

  println(counter)
}

The input file has text with 9 empty lines in between

4) Aggregations and Joins

5) Spark SQL

6) Spark App tuning

Output :

Both counter and cntAccum giving same result.

Upvotes: 1

Views: 775

Answers (1)

Ram Ghadiyaram
Ram Ghadiyaram

Reputation: 29185

counter is local variable may be is working in your local program .master("local[3]") which will execute on driver. imagine you are running yarn mode. then all the logic will be working in a distributed way your local variable wont be updated (since its local its getting updated) but accumulator will be updated. since its distributed variable. suppose you have 2 executors running the program... one executor will update and another executor can able to see the latest value. In this case your cntAccum is capable of getting latest value from other executors in yarn distributed mode. where as local variable counter cant...

since accumulators are read and write. see docs here.

enter image description here

In the image exeutor id is localhost. if you are using yarn with 2-3 executors it will show executor ids. Hope that helps..

Upvotes: 0

Related Questions