how spark write to s3 or azure blob distributively

Question

When we use spark to write out files on AWS s3 or Azure blob storage, we can simply write:

df.write.parquet("/online/path/folder")

then the contents will be written to hundreds of files under the specified folder, like this:

/online/path/folder/f-1
/online/path/folder/f-2
...
/online/path/folder/f-100

My question is since the write is executed on tens or hundreds of sparks executors simultaneously, how do they avoid writting to the same file? Another important question is what is some executor failed and restarted? Will that restarted executor write to the same file before it failed?

falcon-le0 · Accepted Answer

Spark adds UUID, partition number and other task related information to file name, so it guaranties a file name uniqueness across all executors and tasks.

part-00000-a4ec413d-cb30-4103-afe1-410c11a164e8-c000.snappy.parquet

Similar question here: Could anyone please explain what is c000 means in c000.snappy.parquet or c000.snappy.orc??

By default, Spark writes files to a temporary folder and waits for all reducers to complete, then it executes a commit job operation that moves all files to a destination folder. Thus, in case of failure, Spark can safely start a new executor to complete failed tasks and rewrite results.

how spark write to s3 or azure blob distributively

Answers (1)

Related Questions