Force loading RDD from file to memory in Spark

Question

I have a demo application that runs a Spark computation. For that it loads an RDD stored in an object file, and then perform some tasks that depends on the user's input.

Loading the RDD using sparkContext.objectFile() is a lengthy operation. Since time is an issue, I would like to load it before the demo starts, and only perform the calculations that depend on the input during the presentation. However, Spark's lazy policy leads to only reading the file once the entire computation is triggered.

RDD.cache() does not do the trick by its-own. Caching is a lazy operation too.

Is there a way to force-load an RDD from file?

If not, is there a way to speed up RDD load, and/or keep it in memory for future Spark jobs?

Spark version is 1.5 and it runs in a single-node standalone mode. The file is read from the local file system. I can tweak Spark's configuration or these settings if needed.

Tzach Zohar · Accepted Answer

After calling cache(), call any action on your rdd (usually one uses count()) to "materialize" the cache. Further calls to this RDD will use the cached version:

RDD.cache().count() // this will load the RDD
// use RDD, it's cached now

Force loading RDD from file to memory in Spark

Answers (1)

Related Questions