Reputation: 3024
What are the STARTED, COMMITTED , and SUCCESS files that are created in the underlying storage folder when writing/creating a Spark Parquet table? Can there be multiple of those file? If so, what does it mean to have more than one of them?
Thanks.
Upvotes: 8
Views: 6211
Reputation: 861
_SUCCESS
, _started_<id>
, and _committed_<id>
are artifacts of the DBIO transactional commit protocol, enabled by default in Databricks.
_SUCCESS
: Spark will create this file in an output directory upon successful completion of write job. Its presence indicates that the job finished executing without errors and that the output was successfully written to the specific location.
For each DBIO transaction, a unique transaction <id>
is generated.
_started_<id>
: At the start of each transaction, Spark creates an empty _started_<id>
file.
_commit_<id>
: If a transaction is successful, a _committed<id>
file is created.
Each parquet file that is written will also have <id>
in its name. It is very rare, but not impossible, that stale parquet files from a previous overwrite DBIO transaction may remain in prefix. In this rare case, _commit_<id>
can be leveraged to point to the valid / active parquet files.
https://www.databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html
Upvotes: 1
Reputation: 156624
Those files are stored there by the DBIO transactional protocol.
With DBIO transactional commit, metadata files starting with
_started_<id>
and_committed_<id>
accompany data files created by Spark jobs. Generally you shouldn’t alter these files directly. Rather, you should use the VACUUM command to clean them up.
https://docs.databricks.com/spark/latest/spark-sql/dbio-commit.html
Upvotes: 3