E. Erfan
E. Erfan

Reputation: 1411

How to choose partition keys for apache iceberg tables

I have a number of hive warehouses. The data resides in parquet files in Amazon S3. Some of the tables contain TB of data. Currently in hive most tables are partitioned by a combination of month and year, both of which are saved mainly as string. Other fields are either bigint, int, float, double, string and unix timestamps. Our goal is to migrate them to apache iceberg tables. The challenge is how to choose the partition keys.

I have already calculated the cardinality of each field in each table by:

Select COUNT(DISTINCT my_column) As my_column_count
From my_table;

I have also calculated the percentage of null values for each field:

SELECT 100.0 * count(*)/number_of_all_records
FROM my_db.my_table
Where my_column IS NULL; 

In short I already know three things for each field:

  1. Data type
  2. Cardinality
  3. Percentage of null values

By knowing these three pieces of information, my question is how should I choose the best column or combination of columns as partition keys for my future iceberg tables? Are there any rule of thumbs?

How many partitions is considered as optimized when choosing partition keys? What data type is best when choosing partition keys? What are other factors that need to be considered? Is bucketing the same in iceberg tables as it is in hive and how it can be leveraged by the partition keys? What data types are best for partition keys? Is it better to have many small partitions or having a few big partitions? Any other aspects in partition keys that need to be considered?

Upvotes: 3

Views: 2069

Answers (1)

roizaig
roizaig

Reputation: 145

One crucial part is missing from your description - the queries. You need to understand what are the queries that will run on this data. Understanding the queries that will run on the data (to the best you can) is super important. For example, consider a simple table with: Date, Id, Name, Age as columns. If the queries are date based meaning, it will query the data in the context of dates,

select * from table where date > 'some-date'

then it's a good idea to partition by date.

However, if the queries are age related

 select * from table where age between 20 and 30

then you should consider partition by age or age groups

Upvotes: 1

Related Questions