Reputation: 2194
I tested the staging (directory, partitioned) committers versus the magic committer when overwriting data in an s3 compatible object store, and, for some reason, the staging committers are faster when overwriting data and their overwrites are, to a certain extent, atomic unlike the magic committer. The magic committer seems to take a long time to delete old data, but it seems like the staging committers delete old data in an instant. How do they do that?
Magic committer behavior when overwriting some directory that has files:
If there's a failure when writing new data, we lose old data, hence, this committer's overwrite mechanism is not atomic.
Staging committer behavior when overwriting some directory that has files (how I see it):
This seems to be atomic (to a certain extent), because if a Spark job fails halfway, we still have old data.
So, my question is, am I right about the staging committer algorithm?
I have never heard of a mechanism to make files invisible. I have only heard of a mechanism to make files visible after uploading them with the multipart upload API.
So, does the reverse also exist? We can make existing files invisible and set some TTL for them to be deleted later? Or is it some sort of bulk delete approach that deletes multiple files at once in parallel?
Why didn't the magic committer adopt this approach?
I tried googling "how to set s3 files to be hidden/invisible" and didn't find anything.
Upvotes: -1
Views: 396
Reputation: 13470
There's no atomic "make invisible" option in s3. there is a bulk delete call which is very fast, but its nonatomic.
The staging committer has a "partitioned overwrite" variant; the whole committer came from netflix and it met their needs for incremental update of in-place tables.
The magic committer was developed in parallel elsewhere and mimics the behaviour of the classic committer. There is some ongoing work (august 2023) to add more of the INSERT OVERWRITE feature you are looking for, something which may need matching changes in Spark.
Upvotes: 1