Reputation: 262
I created a simple synapse delta lake table via:
CREATE TABLE IF NOT EXISTS db1.tbl1 (
id INT NOT NULL,
name STRING NOT NULL
)
USING DELTA
I've merged rows of data into it multiple times such that I now see a total of 15 parquet files in the underlying ADLS gen 2 storage account container. If I select the rows, I see my most recent merge as expected, e.g.:
+---+-------+
| id| name|
+---+-------+
| 1| adam|
| 2| bob|
| 3|charles|
+---+-------+
however, when I run:
OPTIMIZE db1.tbl1
I still see the 15 files. shouldn't it have "compactified" the files, reducing the total file count significantly?
Upvotes: 1
Views: 1281
Reputation: 3620
Vacuum
would be a more suitable command for your need. Optimize
is more for relocating the data in columnar format and optimizing the performance of your table. Vacuum would remove your old data files. The default for vacuum is to keep 7 days of retention. If you want to vacuum more recent data files, you can specify it with retain parameter. In this case you will shorten the time travel capability. Example with vacuuming everything older than 1 hour:
VACUUM db1.tbl RETAIN 1 HOURS
Optionally you can also dry run this command to see what happens before actually running it. This will list files to be deleted:
VACUUM db1.tbl RETAIN 1 HOURS DRY RUN
More info on vacuum command here: https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-vacuum.
Upvotes: 0