Understand Databricks Structured Streaming Spill to Disk Behavior

Question

I am a running streaming pipeline on Databricks using Pyspark (128gb Memory Cluster with DBR 14.3, Spark 3.5.0). This stream is processing zipped json files and merging them into a delta table. For this Pipeline the data has about 20 columns, not containing any memory heavy content.

We are receiving the changed files from a Azure queue that alerts us of new files being added. These are processed using Auto Loader combined with a glob pattern. We are running with trigger(processingTime="10 minutes" to regularily check for new files and merge them into our table. For debugging I have set the stream to process only a single file per batch.

We are writing the stream using foreachBatch to do some transformations on each batch (adding columns, deduplicating, grouping by Index columns, then merging the batch into delta table) and merge the data into a target delta table. Matched records get two columns updated and notmatched records get inserted into the table.

The problem: I left the stream to run for a few days and it is processing the arriving files well. But each batch increases the consumed memory and also increases the used filesystem space. After a few days more than 500gb of filesystem space was used. This caused our streams to crash after a few days of running. I am trying to find out what is causing Spark to require more and more memory and load up the filesystem.

Looking at Databricks Spark UI I found that several hundred GB of data were written to IO Cache. But it seems to me that this is way more data than was processed by the pipeline. And it does not get released again.

Any recommendations what settings to change to keep our pipeline from crashing?

Understand Databricks Structured Streaming Spill to Disk Behavior

Answers (1)

Related Questions