Gadam
Gadam

Reputation: 3024

What are the _STARTED_, _COMMITTED_ , and _SUCCESS_ files in a Spark Parquet table?

What are the STARTED, COMMITTED , and SUCCESS files that are created in the underlying storage folder when writing/creating a Spark Parquet table? Can there be multiple of those file? If so, what does it mean to have more than one of them?

Thanks.

Upvotes: 8

Views: 6211

Answers (2)

Matthew Thomas
Matthew Thomas

Reputation: 861

_SUCCESS, _started_<id>, and _committed_<id> are artifacts of the DBIO transactional commit protocol, enabled by default in Databricks.

_SUCCESS: Spark will create this file in an output directory upon successful completion of write job. Its presence indicates that the job finished executing without errors and that the output was successfully written to the specific location.

For each DBIO transaction, a unique transaction <id> is generated.

_started_<id>: At the start of each transaction, Spark creates an empty _started_<id> file.

_commit_<id>: If a transaction is successful, a _committed<id> file is created.

Each parquet file that is written will also have <id> in its name. It is very rare, but not impossible, that stale parquet files from a previous overwrite DBIO transaction may remain in prefix. In this rare case, _commit_<id> can be leveraged to point to the valid / active parquet files.

https://www.databricks.com/blog/2017/05/31/transactional-writes-cloud-storage.html

Upvotes: 1

StriplingWarrior
StriplingWarrior

Reputation: 156624

Those files are stored there by the DBIO transactional protocol.

With DBIO transactional commit, metadata files starting with _started_<id> and _committed_<id> accompany data files created by Spark jobs. Generally you shouldn’t alter these files directly. Rather, you should use the VACUUM command to clean them up.

https://docs.databricks.com/spark/latest/spark-sql/dbio-commit.html

Upvotes: 3

Related Questions