Reputation: 1265
I have a simple spark code in which I read a file using SparkContext.textFile()
and then doing some operations on that data, and I am using spark-jobserver
for getting output.
In code I am caching the data but after job ends and I execute that spark-job
again then it is not taking that same file which is already there in cache. So, every time file is getting loaded which is taking more time.
Sample Code is as:
val sc=new SparkContext("local","test")
val data=sc.textFile("path/to/file.txt").cache()
val lines=data.count()
println(lines)
Here, if I am reading the same file then when I execute it second time then it should take data from cache but it is not taking that data from cache.
Is there any way using which I can share the cached data among multiple spark jobs?
Upvotes: 2
Views: 2070
Reputation: 25909
Yes - by calling persist/cache on the RDD you get and submitting additional jobs on the same context
Upvotes: 1