Reputation: 39

write new data into the existing parquet file with append write mode

I am using the code snippet below to save the data. It only creates a new parquet file under the same partition folder. Is there any way to truly append the data into the existing parquet file. So we wont end up having multiple files if there are many appends in a day?

df.coalesce(1).write.mode('append').partitionBy("paritionKey").parquet('...\parquet_file_folder\')

Thanks you so much for your help.

Upvotes: 1

Answers (1)

Adam Dukkon

Reputation: 293

See the answer from here: How can I append to same file in HDFS(spark 2.11)

"Append in Spark means write-to-existing-directory not append-to-file.

This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).

Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis."

Upvotes: 4

write new data into the existing parquet file with append write mode

Answers (1)

Related Questions