Reputation: 13585
I learned that by default Structured Streaming supports HDFSBackedStateStoreProvider
.
It means that all the state related information is being stored at a HDFS location.
Does it ensures that no data is stored In-memory which could cause long GC pauses?
Reason for this question is that the job I am running stops processing data during high traffic volume and catches up after 15-20 minutes of delay.
Upvotes: 0
Views: 847
Reputation: 1708
Does it ensures that no data is stored In-memory which could cause long GC pauses?
Spark maintains some versions of state in executors' memory to avoid re-reading previous state per each batch.
Btw, which version of Spark you're using? In Spark 2.4.0 there're some improvements on memory usage in HDFS state store provider which will heavily reduce memory usage on long-running structured streaming applications. So if you're not using Spark 2.4.0, worth to check it out.
SPARK-24763 [2]: Remove redundant key data from value in streaming aggregation
Upvotes: 2
Reputation: 2228
You were right that Spark structured streaming does have support for HDFSBackedStateStoreProvider
.
However, it doesn't ensure that no data is stored in-memory. It uses HDFS to store checkpoints at regular intervals as write ahead logs. It is done in such a way that if your stream goes down the last known state can be restored from HDFS and the next stream would be able to re-process the data from where the previous stream left-off.
Regarding long GC pauses, you might want to have a look at following article:
Upvotes: 1