Yura Khavyk
Yura Khavyk

Reputation: 11

Apache Iceberg Partitioning

Can someone help me understand the difference between 'hour' and 'hours' partitioning? If I partition by HOUR(created_dt), will it also work if I filter using MONTH(created_dt) in the WHERE clause, I mean if scan will do by partition, cause Iceberg does not let me to create redundant partitions.

select count(*) from bng.raw_game_actions where hours(played_at)=19;

returns nothing

select count(*) from bng.raw_game_actions where hour(played_at)=19;

return what expected

Upvotes: 1

Views: 636

Answers (1)

Greg Lindholm
Greg Lindholm

Reputation: 31

In the PARTITIONED BY clause, hours() is the 'Old syntax' and it supported for compatibility. hour() or date_hour() is the new syntax. These are partition transformations. See: https://iceberg.apache.org/docs/1.4.3/spark-ddl/#partitioned-by

In your SELECT statement hour() is a SQL function and not related to partition transformations. The SQL function will convert the timestamp into hour-of-day.

The hour() partition transformation will transform a timestamp into the number of hours since 1970-01-01. It does not mean hour-of-day.

According to the description of Iceberg's hidden partitioning, if you partition by HOUR(ts) it can use that partition for other queries against the same timestamp field. https://iceberg.apache.org/docs/1.4.3/partitioning/#icebergs-hidden-partitioning

Upvotes: 0

Related Questions