Reputation: 993
I have a case where I want to download some data from a remote store every one hour and store that as Key-Value pairs in a RDD on an executor/worker. I want to cache this RDD so that all future jobs/tasks/batches running on this executor/worker can use the cached RDD to do a lookup. Is this possible in Spark Streaming?
Some relevant code or pointers to relevant code will be helpful.
Upvotes: 2
Views: 414
Reputation: 8995
If you just need a giant, distributed map
, and you want to use Spark, write a standalone job that downloads the data every hours, and caches the RDD
thus obtained (you can unpersist the old RDD
). Let us call this Job DataRefresher
.
You can then expose a REST api (if you are on Scala
, consider using Scalatra
) that wraps the DataRefresher
, and returns the value, given the key. Something like: http://localhost:9191/lookup/key
, which can be used by other jobs to do a relatively fast lookup.
Upvotes: 0
Reputation: 231
Alluxio is a memory-centric distributed storage system. Alluxio can be used to cache Spark RDDs in memory, for multiple and future Spark applications and jobs to access.
Spark can store RDDs in Alluxio memory, and future Spark jobs can read them from Alluxio memory. That blog post has more details on how that works. Here is information on how to setup and configure Alluxio with Spark.
Upvotes: 3
Reputation: 750
Given your requirements, here is what I would propose:
Note: Your notion of "caching within executor to use across application" is not correct. Executors relates to single Spark App, so as any RDD within that app.
If you really need to invest on caching data on distributed nodes, you may want to consider off-heap in-memory databases, such as Tachyon and Alluxio
Upvotes: 0