Gourav
Gourav

Reputation: 1265

how to cache data in apache spark that can be used by other spark job

I have a simple spark code in which I read a file using SparkContext.textFile() and then doing some operations on that data, and I am using spark-jobserver for getting output. In code I am caching the data but after job ends and I execute that spark-job again then it is not taking that same file which is already there in cache. So, every time file is getting loaded which is taking more time.

Sample Code is as:

val sc=new SparkContext("local","test")
val data=sc.textFile("path/to/file.txt").cache()
val lines=data.count()
println(lines)

Here, if I am reading the same file then when I execute it second time then it should take data from cache but it is not taking that data from cache.

Is there any way using which I can share the cached data among multiple spark jobs?

Upvotes: 2

Views: 2070

Answers (1)

Arnon Rotem-Gal-Oz
Arnon Rotem-Gal-Oz

Reputation: 25909

Yes - by calling persist/cache on the RDD you get and submitting additional jobs on the same context

Upvotes: 1

Related Questions