Reputation: 1
Databricks spark write method (df.write.parquet
) for parquet files is transactional. After successfully writing to Azure Data Lake Storage, the file _SUCCESS
is created in the path where parquet files were loaded.
Example of the folder on ADLS including the _SUCCESS
file:
Is it possible to configure AutoLoader to load parquet files only in case the write is done with success (_SUCCESS
file appeared in the folder)? In other words, if listing by AutoLoader folders doesn't include _SUCCESS
files, parquets files from those folders shouldn't be processed by AutoLoaer.
I was looking for the right option in documentation, but it seems like none of the options can help me.
Upvotes: 0
Views: 305
Reputation: 3215
I agree with @JayashankarGS. AutoLoader feature in Databricks allows you to automatically load data from a path into a Delta table when new files are added to that path. However, there is no built-in option in AutoLoader to conditionally load only parquet files that have a corresponding _SUCCESS file in the folder.
If you want to ensure that only parquet files with a successful write (_SUCCESS file) are loaded, modify the AutoLoader logic to include a conditional check. If the _SUCCESS file is found, load the Parquet files. If the _SUCCESS file is not found, indicating a failed write, skip the loading step.
You can try the following:
parquet_path = "<Path/to/parq_files>"
success_file_exists = spark._jvm.org.apache.hadoop.fs.FileSystem.get(spark._jsc.hadoopConfiguration()).exists(spark._jvm.org.apache.hadoop.fs.Path(parquet_path + "/_SUCCESS"))
Reference: apache spark - check if file exists
Upvotes: 0