Marcin U
Marcin U

Reputation: 1

Databricks AutoLoader - how to handle spark write transactional (_SUCCESS file) on Azure Data Lake Storage?

Databricks spark write method (df.write.parquet) for parquet files is transactional. After successfully writing to Azure Data Lake Storage, the file _SUCCESS is created in the path where parquet files were loaded.

Example of the folder on ADLS including the _SUCCESS file: image showing the example of the folder on ADLS including the _SUCCESS file

Is it possible to configure AutoLoader to load parquet files only in case the write is done with success (_SUCCESS file appeared in the folder)? In other words, if listing by AutoLoader folders doesn't include _SUCCESS files, parquets files from those folders shouldn't be processed by AutoLoaer.

I was looking for the right option in documentation, but it seems like none of the options can help me.

Upvotes: 0

Views: 305

Answers (1)

I agree with @JayashankarGS. AutoLoader feature in Databricks allows you to automatically load data from a path into a Delta table when new files are added to that path. However, there is no built-in option in AutoLoader to conditionally load only parquet files that have a corresponding _SUCCESS file in the folder.

If you want to ensure that only parquet files with a successful write (_SUCCESS file) are loaded, modify the AutoLoader logic to include a conditional check. If the _SUCCESS file is found, load the Parquet files. If the _SUCCESS file is not found, indicating a failed write, skip the loading step.

You can try the following:

parquet_path = "<Path/to/parq_files>"
success_file_exists = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()).exists(spark._jvm.org.apache.hadoop.fs.Path(parquet_path + "/_SUCCESS"))

Reference: apache spark - check if file exists

Upvotes: 0

Related Questions