MetallicPriest
MetallicPriest

Reputation: 30765

Does an RDD need to be cached if used more than once?

Let's say, we have the following code.

x = sc.textFile(...)
y = x.map(...)
z = x.map(...)

Is it essential to cache x here? Would not caching x make spark read the input file twice?

Upvotes: 4

Views: 1076

Answers (1)

Ajay Gupta
Ajay Gupta

Reputation: 3212

It is not necessary that these thing will make Spark to read the input twice.

Listing out all possible scenario:

Example 1: Files not read even once

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD

In this case it will not do anything as there is no action along with the transformation.

Example 2: Files read once

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD

Only once the file will be read for y to make it map

Example 3: Files read twice

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

It only will read the input file twice now as an action is used along with the transformation.

Example 4: Files read once

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(z.count())    #Action of RDD

Example 5: Files read twice

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

Since action are used on two different RDD now so it will read it twice.

Example 6: Files read once

x = sc.textFile(...)    #creation of RDD
y = x.map(...).cache()    #Transformation of RDD
z = y.map(...)    #Transformation of RDD
println(y.count())    #Action of RDD
println(z.count())    #Action of RDD

Even now two different action are used only once RDD will be executed and stored into memory. Now the second operation occurs on the Cached RDD.

Edit : Additional Information

So the question arises What to cache and what not to cache?
Ans: The RDD which you will be using again and again need to be cached.
Example 7:

x = sc.textFile(...)    #creation of RDD
y = x.map(...)    #Transformation of RDD
z = x.map(...)    #Transformation of RDD

So in this case as we are using x again and again.So it is advised to cache x.As it will not have to read x again and again from source.So if you are working on huge amount of data this will save a lot of time for you.

Suppose you start to make all RDD as cache in memory/disk with/without serialization. If for doing any task if Spark has less memory then it will start to remove the old RDD using LRU (Last Recently Used) policy.And whenever the removed RDD is used again it will do all the steps from source to reach till that RDD transformation.

Upvotes: 7

Related Questions