Reputation: 518
I am a newbie to scala programming and I am currently working with RDD. I am trying to pass an RDD to a function and would like the function to return so that I can store it to a new RDD. For the purpose I am using map. But map is invoking the function twice whereas there is only one entry inside the RDD. It works fine when I used collect.foreach() instead of map but I am unable to save the updates values in a new RDD as its returning a value in Unit.
This code returns the value from the update function but invokes the function twice:
temp_rdd = my_rdd.map{x => update(x)}
Whereas this one invokes it once perfectly but i cannot modify the RDD values:
my_rdd.collect().foreach{x => update(x)}
The foreach function returns a format in 'Unit' due to ehich I cannot save it in a new RDD. I am looking for a way to store the updated values in a new RDD.
Upvotes: 1
Views: 967
Reputation: 905
From https://spark.apache.org/docs/2.2.0/rdd-programming-guide.html
map
is a transformation that passes each dataset element through a function and returns a new RDD representing the results. All transformations in Spark are lazy, and are computed when an action requires a result to be returned to the driver program. By default, each transformed RDD may be recomputed each time you run an action on it (or you can persist the RDD in memory using the .cache()
).
On the other hand, actions (e.g., collect
or reduce
) return a value (not an RDD) to the driver program after running a computation on the RDD.
Here follows an example of caching an RDD in order to prevent its computation multiple times
val array = Array("1", "2", "3")
val rdd = sc.parallelize(array)
var i = 0
val mapRdd = rdd.map(s"$i: " + _)
mapRdd.take(3).foreach(println) // mapRdd is computed here...
// Output
// 0: 1
// 0: 2
// 0: 3
i = i + 1
mapRdd.take(3).foreach(println) // ... and here
// Output
// 1: 1
// 1: 2
// 1: 3
val cachedMapRdd = rdd.map(s"$i: " + _).cache()
cachedMapRdd.take(3).foreach(println) // cachedMapRdd is computed here...
// Output
// 1: 1
// 1: 2
// 1: 3
i = i + 1
cachedMapRdd.take(3).foreach(println) // ... but not here
// Output
// 1: 1
// 1: 2
// 1: 3
Upvotes: 2