Reputation: 1358

Spark write to S3 SaveMode.Append

With preexisting dataset

s3://data/id=1/file.parquet
s3://data/id=2/file.parquet

And incoming dataframe

/data/id=3/

If the incoming data is written with SaveMode.Append

df.write.partitionBy("id").mode(SaveMode.Append).parquet("s3://data/")

What, if any data in the preexisting data set will be copied to the temp directory that is created?

Upvotes: 5

Answers (1)

Tal Joffe

Reputation: 5848

I have a very similar use case in my Spark application but I'm not sure exactly what is your question so I'll try to explain generally.

In the write of data with id=3 the way you suggested the existing data will stay the same and the data will be appended to s3://data/ under s3://data/id=3.

when using partitionBy the path is the base path so if you would have used overwrite mode the existing files (s3://data/id=1/,s3://data/id=2/) would have been deleted.

Since this is append there is no conflict.

You asked about temp directory.. if you meant the _temporary directory Hadoop output committer uses to write files before committing them than only the files related to the last write will be there while writing the data, once committed they would be deleted.

Upvotes: 5

Spark write to S3 SaveMode.Append

Answers (1)

Related Questions