Reputation: 118
I've been a developer for over a decade, but I'm new to Data Engineering. I set up a couple Iceberg tables in AWS Glue and S3. I've been replicating my production data to these tables for a couple of weeks (~100k-300k inserts per day) and saw that our S3 storage size was exploding. After a little analysis, 99% of this storage was metadata. In the worst case, one table had only 13GB of actual data and 66TB of metadata (I emptied that bucket pretty quickly). Several other buckets had 200 MB to 2GB of data and still had 5TB to 7TB of metadata.
Is it normal for Iceberg to accumulate metadata so quickly? or is this just a factor of having so many inserts on a daily basis?
I tried running the "OPTIMIZE table" query in Athena, which I got from the Athena documentation, but it only scans about 2GB and takes 30 mins per run, which is way too slow to do by hand on 5TB.
Upvotes: 4
Views: 2669
Reputation: 2250
How often are you writing to the Iceberg table. With each insert new metadata is generated, so if possible it would be better to batch inserts when possible.
After each insert a new snapshot is created. The snapshot will link to the existing data and the new data. Once in a while, you want to run OPTIMIZE
as you already suggested to rewrite the data. This will compact Parquet files into bigger ones.
Another job that you need to run periodically is VACUUM
. This will expire the old snapshots and remove the data and metadata. Running VACUUM
will limit the time-traveling offered by Iceberg since old data is being deleted. When a snapshot should expire can be set through a table property.
What's missing here is the actual compaction of metadata. This can be done through Spark using the rewrite_manifests
procedure. This will combine small manifests. It is best to first run rewrite_data_files
, then rewrite_manifests
as mentioned above, and then rewrite_orphan_files
. I know that people run Spark inside a lambda on a schedule to maintain their Iceberg tables.
If you don't want to care about this stuff, there are also commercial vendors such as Tabular that will make sure that your Iceberg tables are in pristine condition.
Hope this helps! Let me know if you have any further questions.
Upvotes: 8