Delete rows from a parquet table using Pyspark using storage account trigger

Question

I'm trying to figure out a pattern to delete rows from a table in my data lake. The data lake is an Azure data lake gen2 and I'm using a Synapse Notebook writing code in PySpark. I have a service bus that get messages when an object has an insert or update that I'm processing in two steps:

A service listening to the service bus and plops the message down in my Bronze storage account as a parquet file with a json structure.
When that file appears in the Bronze file path it triggers a storage account trigger in Synapse that runs a pipeline that runs a Synapse Notebook to process the insert/update to the Silver layer. The Silver layer is an auto partitioned parquet file in the Azure data lake.

Sometimes rows get deleted, and I'm sure I could create a message with the ObjectId and a message type of delete. I can then plop that message in a Bronze file path. What I'm looking for is how would I then delete the row from the partition parquet file in the Silver layer?

boyangeor · Accepted Answer

There is no easy way to delete rows from parquet files and all "solutions" bring many issues with them. This is why formats like delta, hudi, iceberg have been created. They support DELETE and ensure ACID properties of your operation. Here is a good start for Synapse + Delta: link.

Delete rows from a parquet table using Pyspark using storage account trigger

Answers (1)

Related Questions