Atom
Atom

Reputation: 788

Apache spark in memory caching

Spark caches the working dataset into memory and then performs computations at memory speeds. Is there a way to control how long the working set resides in RAM?

I have a huge amount of data that is accessed through the job. It takes time to load the job initially to RAM and when the next job arrives, it has to load all the data again to RAM which is time consuming. Is there a way to cache the data forever(or for specified time) into RAM using Spark?

Upvotes: 7

Views: 6873

Answers (2)

Sujee Maniyam
Sujee Maniyam

Reputation: 1103

To uncache explicitly, you can use RDD.unpersist()

If you want to share cached RDDs across multiple jobs you can try the following:

  1. Cache the RDD using a same context and re-use the context for other jobs. This way you only cache once and use it many times
  2. There are 'spark job servers' that exist to do the above mentioned functionality. Checkout Spark Job Server open sourced by Ooyala.
  3. Use an external caching solution like Tachyon

I have been experimenting with caching options in Spark. You can read more here : http://sujee.net/understanding-spark-caching/

Upvotes: 9

Vijay Innamuri
Vijay Innamuri

Reputation: 4372

You can specify the caching option for an RDD. RDD.cache(MEMORY_ONLY)

Spark automatically clears when no other action requires that RDD.

There is no option to cache an RDD for a specified time.

Please check out below link

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

Upvotes: 0

Related Questions