Reputation: 1146
I'm trying to figure out a pattern to delete rows from a table in my data lake. The data lake is an Azure data lake gen2 and I'm using a Synapse Notebook writing code in PySpark. I have a service bus that get messages when an object has an insert or update that I'm processing in two steps:
Sometimes rows get deleted, and I'm sure I could create a message with the ObjectId and a message type of delete. I can then plop that message in a Bronze file path. What I'm looking for is how would I then delete the row from the partition parquet file in the Silver layer?
Upvotes: 0
Views: 900
Reputation: 1151
There is no easy way to delete rows from parquet
files and all "solutions" bring many issues with them. This is why formats like
delta, hudi, iceberg have been created. They support
DELETE
and ensure ACID properties of your operation. Here is a good start for Synapse + Delta: link.
Upvotes: 2