Reputation: 549
I need to delete a Delta Lake partition with associated AWS s3 files and then need to make sure AWS Athena displays this change. The purpose is because I need to rerun some code to re-populate the data.
I tried this
deltaTable = DeltaTable.forPath(spark, path)
deltaTable.delete("extract_date = '2022-03-01'") #extract date is the partition
And it completed with no errors but the files on s3 still exist and Athena still shows the data even after running MSK REPAIR TABLE
after the delete. Can someone advise the best way to delete partitions and update Athena?
Upvotes: 2
Views: 3985
Reputation: 1
from my observation, I can say that VACUUM doesnt delete s3 files. I've used Vacuum with default retain hours(7 days) and i still see the parq files on s3 even after 7 days have elapsed since the cmd was run
Upvotes: 0
Reputation: 195
will add to Alex's answer, if you want to shorten retention period less than 7 days, you have to change configuration property: spark.databricks.delta.retentionDurationCheck.enabled to false.
from original docs:
Delta Lake has a safety check to prevent you from running a dangerous VACUUM command. If you are certain that there are no operations being performed on this table that take longer than the retention interval you plan to specify, you can turn off this safety check by setting the Spark configuration property spark.databricks.delta.retentionDurationCheck.enabled to false.
Upvotes: 0
Reputation: 87069
Although you performed delete operation, data is still there because Delta tables have history, and actual deletion of the data will happen only when you execute VACUUM operation and operation time will be older than default retention period (7 days). If you want to remove data faster, then you can run VACUUM command with parameter RETAIN XXX HOURS
, but this may require setting some additional properties to enforce that - refer documentation for more details.
Upvotes: 3