Reputation: 26149
Assume I have a Hive table that includes a TIMESTAMP column that is frequently (almost always) included in the WHERE clauses of a query. It makes sense to partition this table by the TIMESTAMP field; however, to keep to a reasonable cardinality, it makes sense to partition by day (not by the maximum resolution of the TIMESTAMP).
What's the best way to achieve this? Should I create an additional column (DATE) and partition on that? Or is there a way to achieve the partition without creating a duplicate column?
Upvotes: 7
Views: 3838
Reputation: 3973
Its not a new column, but its a pseudo-column, You should re-create your table with adding the partitioning specification like this :
create table table_name (
id int,
name string,
timestamp string
)
partitioned by (date string)
Then you load the data creating the partitions dynamically like this
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM table_name_old tno
INSERT OVERWRITE TABLE table_name PARTITION(substring(timestamp,0,10))
SELECT tno.id, tno.name, tno.timestamp;
Now if you select all from your table you will see a new column for the partition, but consider that a Hive partition is just a subdirectory and its not a real column, hence it does not affect the total table size only by some kilobytes.
Upvotes: 6
Reputation: 5834
As partition is also one of the column in hive, every partition has value(assign using static or dynamic partition) and every partition is mapped to directory in HDFS, so it has to be additional column.
You may choose one the below option:
Let's say table DDL:
CREATE TABLE temp( id string) PARTITIONED BY (day int)
ALTER TABLE xyz
ADD PARTITION (day=00)
location '/2017/02/02';
or
INSERT OVERWRITE TABLE xyz
PARTITION (day=1)
SELECT id FROM temp
WHERE dayOfTheYear(**timestamp**)=1;
INSERT INTO TABLE xyz
PARTITION (day)
SELECT id ,
dayOfTheYear(day)
FROM temp;
Hive doesn't have any dayOfTheYear
function you create it.
Upvotes: 0