How to partition Delta tables efficiently?

Question

Looking for efficient partitioning strategies for my dataframe when storing my dataframe in the delta table.

My current dataframe 1.5000.000 rowa it takes 3.5h to move data from dataframe to delta table.

Looking for a more efficient way to do this writing I decided to try different columns of my table as partitioning columns.I searched for the cardinality of my columns and selected the following ones.

column1 = have 3 distinct_values
column2 = have 7 distinct values
column3 = have 26 disctinc values
column4 = have 73 distinc values
column5 = have 143 distinc values
column6 = have 246 distinct values
column7 = have 543 disctinc values

cluster: 64GB, 8 cores

using the folloging code in my notebook

df.write.partitionBy("column_1").format("delta").mode("overwrite").save(partition_1)
..
df.write.partitionBy("column_7").format("delta").mode("overwrite").save(partition7)

Thus, I wanted to see which partitioning strategy would bring better results: a column with high cardinality, one with low cardinality or one in between. To my surprise this has not had any effect as it has taken practically the same time in all of them with differences of a few minutes but all of them with + 3h.

why have I failed ? is there no advantage to partitioning ?

Alex Ott · Accepted Answer

When you use Delta (either Databricks or OSS Delta 1.2.x, better 2.0) then often you may not need to use partitioning at all for following reasons (that aren't applicable for Parquet or other file formats):

Delta supports data skipping that allows to read only necessary files, especially effective when you use it in combination with OPTIMIZE ZORDER BY that will put related data closer to each other.
Bloom filters allow to skip files even more granularly.

The rules of thumb of using partitioning with Delta lake tables are following:

use it when it will benefit queries, especially when you perform MERGE into the table, because it allows to avoid conflicts between parallel transactions
when it helps to delete old data (for example partitioning by date)
when it really benefits your queries. For example, you have data per country, and most of queries will use country as a part of condition. Or for example, when you partition by date, and querying data based on the time...

In all cases, don't use partitioning for high cardinality columns (hundreds of values) and having too many partition columns because in most cases it lead to creation of small files that are less efficient to read (each file is accessed separately), plus it leads to increased load to the driver as it needs to keep metadata for each of the file.

How to partition Delta tables efficiently?

Answers (1)

Related Questions