optimizing synapse delta lake table not reducing the number of files

Question

I created a simple synapse delta lake table via:

CREATE TABLE IF NOT EXISTS db1.tbl1 (
id INT NOT NULL,
name STRING NOT NULL
)
USING DELTA

I've merged rows of data into it multiple times such that I now see a total of 15 parquet files in the underlying ADLS gen 2 storage account container. If I select the rows, I see my most recent merge as expected, e.g.:

+---+-------+
| id|   name|
+---+-------+
|  1|   adam|
|  2|    bob|
|  3|charles|
+---+-------+

however, when I run:

OPTIMIZE db1.tbl1

I still see the 15 files. shouldn't it have "compactified" the files, reducing the total file count significantly?

Veikko · Accepted Answer

Vacuum would be a more suitable command for your need. Optimize is more for relocating the data in columnar format and optimizing the performance of your table. Vacuum would remove your old data files. The default for vacuum is to keep 7 days of retention. If you want to vacuum more recent data files, you can specify it with retain parameter. In this case you will shorten the time travel capability. Example with vacuuming everything older than 1 hour:

VACUUM db1.tbl RETAIN 1 HOURS

Optionally you can also dry run this command to see what happens before actually running it. This will list files to be deleted:

VACUUM db1.tbl RETAIN 1 HOURS DRY RUN

More info on vacuum command here: https://learn.microsoft.com/en-us/azure/databricks/spark/latest/spark-sql/language-manual/delta-vacuum.

optimizing synapse delta lake table not reducing the number of files

Answers (1)

Related Questions