Reputation: 788
Spark caches the working dataset into memory and then performs computations at memory speeds. Is there a way to control how long the working set resides in RAM?
I have a huge amount of data that is accessed through the job. It takes time to load the job initially to RAM and when the next job arrives, it has to load all the data again to RAM which is time consuming. Is there a way to cache the data forever(or for specified time) into RAM using Spark?
Upvotes: 7
Views: 6873
Reputation: 1103
To uncache explicitly, you can use RDD.unpersist()
If you want to share cached RDDs across multiple jobs you can try the following:
I have been experimenting with caching options in Spark. You can read more here : http://sujee.net/understanding-spark-caching/
Upvotes: 9
Reputation: 4372
You can specify the caching option for an RDD. RDD.cache(MEMORY_ONLY)
Spark automatically clears when no other action requires that RDD.
There is no option to cache an RDD for a specified time.
Please check out below link
http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose
Upvotes: 0