Reputation: 352
Why does the accumulator variable in the following code does not print the aggregated string?
object mapRDD {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local")
.appName("sparkSessionName")
.getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val data = Seq("Project",
"Gutenberg’s",
"Alice’s",
"Adventures",
"in",
"Wonderland")
val rdd = spark.sparkContext.parallelize(data)
var accumulator: String = "WHY IS THE AGGREGATED STRING NOT PRINTED?"
for (eachElementOfRDD <- rdd) {
accumulator = accumulator ++ eachElementOfRDD
}
println(accumulator)
}
}
Output 21/05/13 09:10:52 INFO BlockManagerMasterEndpoint: Registering block manager host.docker.internal:64786 with 894.3 MB RAM, BlockManagerId(driver, host.docker.internal, 64786, None) 21/05/13 09:10:52 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, host.docker.internal, 64786, None) 21/05/13 09:10:52 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, host.docker.internal, 64786, None) WHY IS THE AGGREGATED STRING NOT PRINTED?
I am a newbie to spark and scala and I understand the output of the following code is correct. What I want to know is the reason for such a behaviour. What this concept is called and some pointers to understand it.
EDIT The variable name accumulator was a co-incidence to the accumulator functionality of spark. I am concerned with adding a string to the original string until the loop is completed.
Upvotes: 0
Views: 132
Reputation: 14845
Two things should be changed in the code to get the expected output:
val accumulator = spark.sparkContext.collectionAccumulator[String]
rdd.foreach( eachElementOfRDD => accumulator.add(eachElementOfRDD))
println(accumulator.value)
Output:
[Project, Gutenberg’s, Alice’s, Adventures, in, Wonderland]
Upvotes: 1