Spark structured streaming state management

I learned that by default Structured Streaming supports HDFSBackedStateStoreProvider. It means that all the state related information is being stored at a HDFS location.

Does it ensures that no data is stored In-memory which could cause long GC pauses?

Reason for this question is that the job I am running stops processing data during high traffic volume and catches up after 15-20 minutes of delay.

Upvotes: 0

Answers (2)

Jungtaek Lim

Reputation: 1708

Does it ensures that no data is stored In-memory which could cause long GC pauses?

Spark maintains some versions of state in executors' memory to avoid re-reading previous state per each batch.

Btw, which version of Spark you're using? In Spark 2.4.0 there're some improvements on memory usage in HDFS state store provider which will heavily reduce memory usage on long-running structured streaming applications. So if you're not using Spark 2.4.0, worth to check it out.

SPARK-24717 [1]: Split out min retain version of state for memory in HDFSBackedStateStoreProvider
SPARK-24763 [2]: Remove redundant key data from value in streaming aggregation
1. https://issues.apache.org/jira/browse/SPARK-24717
2. https://issues.apache.org/jira/browse/SPARK-24763

Upvotes: 2

Akhil Bojedla

Reputation: 2228

You were right that Spark structured streaming does have support for HDFSBackedStateStoreProvider.

However, it doesn't ensure that no data is stored in-memory. It uses HDFS to store checkpoints at regular intervals as write ahead logs. It is done in such a way that if your stream goes down the last known state can be restored from HDFS and the next stream would be able to re-process the data from where the previous stream left-off.

Regarding long GC pauses, you might want to have a look at following article:

https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html

Upvotes: 1

Spark structured streaming state management

Answers (2)

Related Questions