dkapitan
dkapitan

Reputation: 931

Is there a way to delete a record in all versions from delta lake?

We are investigating how to implement the GDPR 'right to be forgotten' in Delta Lake. Basically, the key functionality is to delete a record (from a person who has requested to have their data removed) from delta lake, including previous versions.

I thought (hoped) that VACUUM would do the trick, but as I understand it, VACUUM deletes whole tables. Hence, I lose the history of all other records, which I would like to keep.

Here is a notebook demonstrating what I want to do.

Upvotes: 3

Views: 1589

Answers (1)

Alex Ott
Alex Ott

Reputation: 87299

Versions in Delta tables are immutable - each modification operation doesn't change the existing files, but take the original data from it, do modification & create a new version. Because of that, you need to do modification of the data & clean the old versions using the VACUUM. Databricks has very good guide on handling of GDPR & CCPA data using the Delta Lake, that describes how to approach to that problem.

Theoretically, you can write a script that will go through the whole history, read each version, do modification of the data, and write as a new version, and at the end do the vacuum, but that could be quite resource intensive.

Also, if you'll need to perform that operation periodically, you may think about other approaches, like, encrypting each user's data with individual keys, separating the PII data into a separate table that you can modify, and other things.

Upvotes: 1

Related Questions