In Spark Streaming, can we store data (hashmap) in Executor memory

Question

I want to maintain a cache (HashMap) in Spark Executors memory (long lived cache) so that all tasks running on the executor (at different times) can do a lookup there and also be able to update the cache.

Is this possible in Spark streaming?

Zyoma · Accepted Answer

I'm not sure there is a way to store custom data structures permanently on executors. My suggestion here is to use some external caching system (like Redis, Memcached or even ZooKeeper in some cases). You can further connect to that system using such methods like foreachPartition or mapPartitions while processing RDD/DataFrame to reduce the number of connections to 1 connection per partition.

The reason of why this would work is that i.e. both Redis and Memcached are in-memory storages so there will be no overhead of spilling data to disk.

The two other ways to distribute some state across executors are Accumulators and Broadcast variables. For Accumulators all executors can write into it but reading can be performed only by driver. For Broadcast variable you write it only once on driver and then distribute it as a read-only data structure to executors. Both cases doesn't work for you so the described solution is the only possible way that I can see here.

In Spark Streaming, can we store data (hashmap) in Executor memory

Answers (1)

Related Questions