Why are S3A staging committers faster when overwriting data?

Question

I tested the staging (directory, partitioned) committers versus the magic committer when overwriting data in an s3 compatible object store, and, for some reason, the staging committers are faster when overwriting data and their overwrites are, to a certain extent, atomic unlike the magic committer. The magic committer seems to take a long time to delete old data, but it seems like the staging committers delete old data in an instant. How do they do that?

Magic committer behavior when overwriting some directory that has files:

Delete all files (which may take a long time if you have a lot of data)
Write new data

If there's a failure when writing new data, we lose old data, hence, this committer's overwrite mechanism is not atomic.

Staging committer behavior when overwriting some directory that has files (how I see it):

It writes data to s3 without completing the write (using the multipart upload API)
When all tasks of a job are finished, and the Spark job does job commit, it seems as though staging committers make old files invisible without deleting them and make newly uploaded files visible. I say it, because I was monitoring a job that was using the directory committer, and during the job execution the old files were there, and at the end of that job I saw the old files quickly get replaced by the new files.

This seems to be atomic (to a certain extent), because if a Spark job fails halfway, we still have old data.

So, my question is, am I right about the staging committer algorithm?

I have never heard of a mechanism to make files invisible. I have only heard of a mechanism to make files visible after uploading them with the multipart upload API.

So, does the reverse also exist? We can make existing files invisible and set some TTL for them to be deleted later? Or is it some sort of bulk delete approach that deletes multiple files at once in parallel?

Why didn't the magic committer adopt this approach?

I tried googling "how to set s3 files to be hidden/invisible" and didn't find anything.

stevel · Accepted Answer

There's no atomic "make invisible" option in s3. there is a bulk delete call which is very fast, but its nonatomic.

The staging committer has a "partitioned overwrite" variant; the whole committer came from netflix and it met their needs for incremental update of in-place tables.

The magic committer was developed in parallel elsewhere and mimics the behaviour of the classic committer. There is some ongoing work (august 2023) to add more of the INSERT OVERWRITE feature you are looking for, something which may need matching changes in Spark.

Why are S3A staging committers faster when overwriting data?

Answers (1)

Related Questions