Reputation: 2080
Looking for efficient partitioning strategies for my dataframe when storing my dataframe in the delta table.
My current dataframe 1.5000.000 rowa it takes 3.5h to move data from dataframe to delta table.
Looking for a more efficient way to do this writing I decided to try different columns of my table as partitioning columns.I searched for the cardinality of my columns and selected the following ones.
column1 = have 3 distinct_values
column2 = have 7 distinct values
column3 = have 26 disctinc values
column4 = have 73 distinc values
column5 = have 143 distinc values
column6 = have 246 distinct values
column7 = have 543 disctinc values
cluster: 64GB, 8 cores
using the folloging code in my notebook
df.write.partitionBy("column_1").format("delta").mode("overwrite").save(partition_1)
..
df.write.partitionBy("column_7").format("delta").mode("overwrite").save(partition7)
Thus, I wanted to see which partitioning strategy would bring better results: a column with high cardinality, one with low cardinality or one in between. To my surprise this has not had any effect as it has taken practically the same time in all of them with differences of a few minutes but all of them with + 3h.
why have I failed ? is there no advantage to partitioning ?
Upvotes: 3
Views: 14479
Reputation: 87119
When you use Delta (either Databricks or OSS Delta 1.2.x, better 2.0) then often you may not need to use partitioning at all for following reasons (that aren't applicable for Parquet or other file formats):
The rules of thumb of using partitioning with Delta lake tables are following:
In all cases, don't use partitioning for high cardinality columns (hundreds of values) and having too many partition columns because in most cases it lead to creation of small files that are less efficient to read (each file is accessed separately), plus it leads to increased load to the driver as it needs to keep metadata for each of the file.
Upvotes: 7