Sankar Krishnamoorthy
Sankar Krishnamoorthy

Reputation: 11

Delta Table and maintenance Strategy

I am not completely new to AWS. I need a design suggestion for my architecture .

I have to load say 50 files from different sources every month and these files are very small, less than 500 MB per file .

I am reading from S3 and loading to Delta table using Databricks and then exposing them through DB SQL.

  1. Do I really have to worry about partitioning in my delta table since its not a big file
  2. Is there any way to partition a delta table by size ?
  3. Not sure I understand completely how a vacuum/ optimize will run on a non partitioned delta table and so I am hoping that I will do optimize vacuum once monthly after my load.

does this sound correct ?

Please suggest from your experience/implementations

Sankar

Upvotes: 1

Views: 863

Answers (2)

Powers
Powers

Reputation: 19308

Do I really have to worry about partitioning in my delta table since its not a big file

Partitioning makes queries that filter on the partition key run faster. So if your data is partitioned by country, a query that has a WHERE statement like WHERE country='china' will be faster.

Partitioning exacerbates the small file problem, so you can just avoid it if your queries are running fast enough.

Is there any way to partition a delta table by size?

You partition Delta Tables by columns in the data, so they're not partitioned by size

Not sure I understand completely how a vacuum/ optimize will run on a non partitioned delta table and so I am hoping that I will do optimize vacuum once monthly after my load.

OPTIMIZE will combine small files into bigger files (referred to as compaction or bin-packing). Big data processing systems don't like lots of small files. OPTIMIZE will compact the small files in an unpartitioned lake or the small files within each partition of a partitioned lake.

OPTIMIZE is only needed if you have a bunch of small files that should be compacted. If you're creating lots of small files then OPTIMIZE is really important, but it's less important if the files aren't really tiny.

Upvotes: 0

Alex Ott
Alex Ott

Reputation: 87119

Partitioning usually is used when your queries will benefit from it - for example, you have something like, country column and then your users will query data by a specific country. Otherwise partitioning isn't strictly required, especially with Delta that has additional optimizations, like data skipping, bloom filters, OPTIMZE ZORDER BY, etc. (see for example this answer, or this and this).

In general I would recommend to perform OPTIMIZE ZORDER BY after you load your data - it will compact small files, and will colocate related data closer to each other. OPTIMIZE on unpartitioned table usually works well, as it's trying to avoid touching the files that are already optimized. Just don't forget to run VACUUM periodically to get rid of the not referenced files.

Upvotes: 2

Related Questions