Reputation: 217
I had a small confusion on transactional log of Delta lake. In the documentation it is mentioned that by default retention policy is 30 days and can be modified by property -: delta.logRetentionDuration=interval-string
.
But I don't understand when the actual log files are deleted from the delta_log folder. Is it when we run some operation? Or may be VACCUM operation. However, it is mentioned that VACCUM operation only deletes data files and not logs. But will it delete logs older than specified log retention duration?
reference -: https://docs.databricks.com/delta/delta-batch.html#data-retention
Upvotes: 6
Views: 8630
Reputation: 10693
The value of the option is an interval literal. There is no way to specify literal infinite and months and years are not allowed for this particular option (for a reason). However nothing stops you from saying interval 1000000000 weeks
- 19 million years is effectively infinite.
Upvotes: 2
Reputation: 461
By default, the reference implementation creates a checkpoint every 10 commits.
There is an async process that runs for every 10th commit to the _delta_log
folder. It will create a checkpoint file and will clean up the .crc
and .json
files that are older than the delta.logRetentionDuration
.
Checkpoints.scala has checkpoint
> checkpointAndCleanupDeltaLog
> doLogCleanup
. MeetadataCleanup.scala has doLogCleanup
> cleanUpExpiredLogs
.
Upvotes: 3