Aniket Kumar
Aniket Kumar

Reputation: 324

Update existing records of parquet file in Azure

I am converting my table into parquet file format using Azure Data Factory. Performing query on parquet file using databricks for reporting. I want to update only existing records which are updated in original sql server table. Since I am performing it on very big table and daily I don't want to perform truncate and reload entire table as it will be costly.

Is there any way I can update those parquet file without performing truncate and reload operation.

Upvotes: 1

Views: 5395

Answers (3)

Deeraj Kumar
Deeraj Kumar

Reputation: 1

Always go for soft Delete while working in No-Sql. Hard delete if very costly.

Also, with soft-Delete, down stream pipeline can consume the update and act upon it.

Upvotes: -1

Aniket Kumar
Aniket Kumar

Reputation: 324

I have found a workaround to this problem.

  1. Read the parquet file into data frame using any tool or Python scripts.
  2. create a temporary table or view from data frame.
  3. Run SQL query to modify, update and delete the record.
  4. Convert table back into data frame
  5. Overwrite existing parquet files with new data.

Upvotes: 1

Alex Ott
Alex Ott

Reputation: 87164

Parquet is by default immutable, so only way to rewrite the data is to rewrite the table. But that is possible to do if you switch to use of Delta file format that supports updating/deleting the entries, and is also supports MERGE operation.

You can still use Parquet format for production of the data, but then you need to use that data to update the Delta table.

Upvotes: 2

Related Questions