Reputation: 11
I am not completely new to AWS. I need a design suggestion for my architecture .
I have to load say 50 files from different sources every month and these files are very small, less than 500 MB per file .
I am reading from S3 and loading to Delta table using Databricks and then exposing them through DB SQL.
does this sound correct ?
Please suggest from your experience/implementations
Sankar
Upvotes: 1
Views: 863
Reputation: 19308
Do I really have to worry about partitioning in my delta table since its not a big file
Partitioning makes queries that filter on the partition key run faster. So if your data is partitioned by country, a query that has a WHERE statement like WHERE country='china'
will be faster.
Partitioning exacerbates the small file problem, so you can just avoid it if your queries are running fast enough.
Is there any way to partition a delta table by size?
You partition Delta Tables by columns in the data, so they're not partitioned by size
Not sure I understand completely how a vacuum/ optimize will run on a non partitioned delta table and so I am hoping that I will do optimize vacuum once monthly after my load.
OPTIMIZE will combine small files into bigger files (referred to as compaction or bin-packing). Big data processing systems don't like lots of small files. OPTIMIZE will compact the small files in an unpartitioned lake or the small files within each partition of a partitioned lake.
OPTIMIZE is only needed if you have a bunch of small files that should be compacted. If you're creating lots of small files then OPTIMIZE is really important, but it's less important if the files aren't really tiny.
Upvotes: 0
Reputation: 87119
Partitioning usually is used when your queries will benefit from it - for example, you have something like, country column and then your users will query data by a specific country. Otherwise partitioning isn't strictly required, especially with Delta that has additional optimizations, like data skipping, bloom filters, OPTIMZE ZORDER BY, etc. (see for example this answer, or this and this).
In general I would recommend to perform OPTIMIZE ZORDER BY
after you load your data - it will compact small files, and will colocate related data closer to each other. OPTIMIZE
on unpartitioned table usually works well, as it's trying to avoid touching the files that are already optimized. Just don't forget to run VACUUM periodically to get rid of the not referenced files.
Upvotes: 2