Reputation: 6475
I'm new to Spark and have a question regarding the RDD. Let's say I define an RDD as follows:
val data1 = sc.textFile()
and then let's say I do the following
1) val data2 = data1.map{...}
2) val data3 = data1
I'm curious to know what happens behind the scene in 1 and 2. Are data1
,data2
and data3
are totally different in memory? That is, does each of them take some piece of memory or there is some level of data sharing? For example, there is just one piece of memory for data1
and data3
?
Upvotes: 0
Views: 112
Reputation: 67075
RDD
's are merely representations of the work to be done, known as the lineage. And, that is "immutable"*, which leads me to the first scenario. data1
is an instruction to load a file. When you use it's map
method, that composes that instruction with the new one so that it returns a new instruction set which is load file then transform it
. So, it is a new instruction set containing the first. In the second scenario, you end up with two memory locations that point to the same instructions.
So, really, ALL of those scenarios have the same instruction set. You can see that in the following code:
val init = sc.parallelize(1 to 10).map(x=>{println(x);x})
val mapped = init.map(_+1)
val initCopy = init
initCopy.cache
initCopy.collect //Notice that the println occurs...this also caches the end result
mapped.collect //Notice that the println does NOT occur since it was using the same instruction that was cached
*I say this in quotes because pieces of it can be modified, like when you call cache
, but the lineage is immutable.
Upvotes: 1