Reputation: 3391
I'm trying to recreate the flatMap
function using foreach
and List.concat
but the resulting list seems unchanged.
Here's the reference:
val rdd: List[String] = List("Hello sentence one",
"This is the next sentence",
"The last sentence")
val fm: List[String] = rdd.flatMap(s => s.split("\\W"))
println(fm)
which gives:
List(Hello, sentence, one, This, is, the, next, sentence, The, last, sentence)
And here is my approach to recreate the same:
val nonRdd: List[String] = List("Hello sentence one",
"This is the next sentence",
"The last sentence")
var nonfm: List[String] = List()
nonRdd.foreach(line => List.concat(nonfm, line.split("\\W")))
println("nonfm: " + nonfm)
So every line is split on word and the resulting, intermediate line is supposed to be concatenated to the previously initialized list nonfm
.
However, nonfm
is empty:
nonfm: List()
Upvotes: 0
Views: 510
Reputation: 2734
As I have mentioned in the comments section, List
in Scala will default to scala.collection.immutable
As the documentation suggests, concat returns a new list rather than mutating the original one (it couldn't anyway since it's immutable)
Returns a new sequence containing the elements from the left hand operand followed by the elements from the right hand operand.
So you need to update the variable on every iteration with a simple assignment
val nonRdd: List[String] = List("Hello sentence one",
"This is the next sentence",
"The last sentence")
var nonfm: List[String] = List()
nonRdd.foreach(line => nonfm = List.concat(nonfm, line.split("\\W")))
println("nonfm: " + nonfm)
Based on the use of the word RDD, I am guessing you are going to be using Spark eventually. I am hoping you are simply experimenting and trying to understand how things work, but please do not ever use variables in Spark (or in Scala in general). See @Avishek's answer for why they will break your program in Spark
Upvotes: 4
Reputation: 6994
This is right behaviour. The variable var nonfm: List[String] = List()
is defined in the master. When you run the nonRdd.foreach(line => List.concat(nonfm, line.split("\\W")))
each partition of the nonRdd gets its own copy of the nonfm
.
When the foreach runs, master sends each rdd partition the closure i.e. code and variables by serializing it. These partitions might be running in completely different machine altogether.
Finally, when you do println("nonfm: " + nonfm)
it prints the nonfm
declared in the master. This copy of the variable hasn't been mutated at all. Thus provides empty result.
Upvotes: 1