Reputation: 11
Can someone help me understand the difference between 'hour' and 'hours' partitioning? If I partition by HOUR(created_dt), will it also work if I filter using MONTH(created_dt) in the WHERE clause, I mean if scan will do by partition, cause Iceberg does not let me to create redundant partitions.
select count(*) from bng.raw_game_actions where hours(played_at)=19;
returns nothing
select count(*) from bng.raw_game_actions where hour(played_at)=19;
return what expected
Upvotes: 1
Views: 636
Reputation: 31
In the PARTITIONED BY clause, hours()
is the 'Old syntax' and it supported for compatibility. hour()
or date_hour()
is the new syntax. These are partition transformations.
See: https://iceberg.apache.org/docs/1.4.3/spark-ddl/#partitioned-by
In your SELECT statement hour()
is a SQL function and not related to partition transformations. The SQL function will convert the timestamp into hour-of-day.
The hour()
partition transformation will transform a timestamp into the number of hours since 1970-01-01. It does not mean hour-of-day.
According to the description of Iceberg's hidden partitioning, if you partition by HOUR(ts) it can use that partition for other queries against the same timestamp field. https://iceberg.apache.org/docs/1.4.3/partitioning/#icebergs-hidden-partitioning
Upvotes: 0