Update existing records of parquet file in Azure

I am converting my table into parquet file format using Azure Data Factory. Performing query on parquet file using databricks for reporting. I want to update only existing records which are updated in original sql server table. Since I am performing it on very big table and daily I don't want to perform truncate and reload entire table as it will be costly.

Is there any way I can update those parquet file without performing truncate and reload operation.

Upvotes: 1

Answers (3)

Deeraj Kumar

Reputation: 1

Always go for soft Delete while working in No-Sql. Hard delete if very costly.

Also, with soft-Delete, down stream pipeline can consume the update and act upon it.

Upvotes: -1

Aniket Kumar

Reputation: 324

I have found a workaround to this problem.

Read the parquet file into data frame using any tool or Python scripts.
create a temporary table or view from data frame.
Run SQL query to modify, update and delete the record.
Convert table back into data frame
Overwrite existing parquet files with new data.

Upvotes: 1

Alex Ott

Reputation: 87164

Parquet is by default immutable, so only way to rewrite the data is to rewrite the table. But that is possible to do if you switch to use of Delta file format that supports updating/deleting the entries, and is also supports MERGE operation.

You can still use Parquet format for production of the data, but then you need to use that data to update the Delta table.

Upvotes: 2

Update existing records of parquet file in Azure

Answers (3)

Related Questions