How loading RDD works in Spark

Question

I'm new to Spark and have a question regarding the RDD. Let's say I define an RDD as follows:

val data1 = sc.textFile()

and then let's say I do the following

1) val data2 = data1.map{...}
2) val data3 = data1

I'm curious to know what happens behind the scene in 1 and 2. Are data1,data2 and data3 are totally different in memory? That is, does each of them take some piece of memory or there is some level of data sharing? For example, there is just one piece of memory for data1 and data3 ?

Justin Pihony · Accepted Answer

RDD's are merely representations of the work to be done, known as the lineage. And, that is "immutable"*, which leads me to the first scenario. data1 is an instruction to load a file. When you use it's map method, that composes that instruction with the new one so that it returns a new instruction set which is load file then transform it. So, it is a new instruction set containing the first. In the second scenario, you end up with two memory locations that point to the same instructions.

So, really, ALL of those scenarios have the same instruction set. You can see that in the following code:

val init = sc.parallelize(1 to 10).map(x=>{println(x);x})
val mapped = init.map(_+1)
val initCopy = init
initCopy.cache
initCopy.collect //Notice that the println occurs...this also caches the end result
mapped.collect //Notice that the println does NOT occur since it was using the same instruction that was cached

*I say this in quotes because pieces of it can be modified, like when you call cache, but the lineage is immutable.

How loading RDD works in Spark

Answers (1)

Related Questions