Reputation: 1031
I have a dataset that is being read from multiple programs. Instead of reading this dataset into memory a number of times each day, is there a way for spark to effectively cache the dataset, allowing any program to call upon it?
Upvotes: 4
Views: 563
Reputation: 51
I think you should try checkpoint()
, the RDDs which call checkpoint()
is saved in HDFS or local file, which can live across multiple applications.
Upvotes: 0
Reputation: 16096
RDDs and Datasets cannot be shared between application (at least, there is no official API to share memory)
However, you may be interested in Data Grid. Look at Apache Ignite. You can i.e. load data to Spark, preprocess it and save to grid. Then, in other applications you could just read data from Ignite cache.
There is a special type of RDD, named IgniteRDD, which allows you to use Ignite cache just like other data sources. Of course, like any other RDD, it can be converted to Dataset
It would be something like this:
val rdd = igniteContext.fromCache("igniteCache")
val dataFrame = rdd.toDF
More information about IgniteContext and IgniteRDD you can find here
Upvotes: 1