Reputation: 30765
Let's say, we have the following code.
x = sc.textFile(...)
y = x.map(...)
z = x.map(...)
Is it essential to cache x
here? Would not caching x
make spark read the input file twice?
Upvotes: 4
Views: 1076
Reputation: 3212
It is not necessary that these thing will make Spark to read the input twice.
Listing out all possible scenario:
Example 1: Files not read even once
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
In this case it will not do anything as there is no action along with the transformation.
Example 2: Files read once
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
Only once the file will be read for y to make it map
Example 3: Files read twice
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
It only will read the input file twice now as an action is used along with the transformation.
Example 4: Files read once
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(z.count()) #Action of RDD
Example 5: Files read twice
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Since action are used on two different RDD now so it will read it twice.
Example 6: Files read once
x = sc.textFile(...) #creation of RDD
y = x.map(...).cache() #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Even now two different action are used only once RDD will be executed and stored into memory. Now the second operation occurs on the Cached RDD.
Edit : Additional Information
So the question arises What to cache and what not to cache?
Ans: The RDD which you will be using again and again need to be cached.
Example 7:
x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
So in this case as we are using x
again and again.So it is advised to cache x
.As it will not have to read x
again and again from source.So if you are working on huge amount of data this will save a lot of time for you.
Suppose you start to make all RDD as cache in memory/disk with/without serialization. If for doing any task if Spark has less memory then it will start to remove the old RDD using LRU (Last Recently Used) policy.And whenever the removed RDD is used again it will do all the steps from source to reach till that RDD transformation.
Upvotes: 7