How does Spark Structured Streaming handle in-memory state when state data is growing?

Question

In Spark Structured Streaming (version 2.2.0), in case of using mapGroupsWithState query with Update mode as the output mode, It seems that Spark is storing the in-memory state data using java.util.ConcurrentHashMap data structure. Can someone explain to me in detail that what happens when the state data grows and there isn't enough memory anymore? Also, is it possible to change the limit for storing the state data in the memory, using a spark config parameter?

Yuval Itzchakov · Accepted Answer

Can someone explain to me in detail that what happens when the state data grows and there isn't enough memory anymore

The executor will crash with an OOM exception. Since with mapGroupWithState, you're the one in charge of adding and removing state, if you're overwhelming the JVM with data you can't allocate memory for, the process will crash.

is it possible to change the limit for storing the state data in the memory, using a spark config parameter?

It isn't possible to limit the number of bytes you're storing in memory. Again, if this is mapGroupsWithState, you have to manage state in such a way that won't cause your JVM to OOM, such as setting timeouts and removing state. If we're talking about stateful aggregations where Spark manages the state for you, such as the agg combinator, then you can limit the state using a watermark which will evict old data from memory once the time frame passes.

How does Spark Structured Streaming handle in-memory state when state data is growing?

Answers (2)

Related Questions