Jared
Jared

Reputation: 26149

How can I partition a hive table by (only) a portion of a timestamp column?

Assume I have a Hive table that includes a TIMESTAMP column that is frequently (almost always) included in the WHERE clauses of a query. It makes sense to partition this table by the TIMESTAMP field; however, to keep to a reasonable cardinality, it makes sense to partition by day (not by the maximum resolution of the TIMESTAMP).

What's the best way to achieve this? Should I create an additional column (DATE) and partition on that? Or is there a way to achieve the partition without creating a duplicate column?

Upvotes: 7

Views: 3838

Answers (2)

54l3d
54l3d

Reputation: 3973

Its not a new column, but its a pseudo-column, You should re-create your table with adding the partitioning specification like this :

create table table_name (
  id                int,
  name              string,
  timestamp         string
)
partitioned by (date string)

Then you load the data creating the partitions dynamically like this

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
FROM table_name_old tno
INSERT OVERWRITE TABLE table_name PARTITION(substring(timestamp,0,10))
       SELECT tno.id, tno.name, tno.timestamp;

Now if you select all from your table you will see a new column for the partition, but consider that a Hive partition is just a subdirectory and its not a real column, hence it does not affect the total table size only by some kilobytes.

Upvotes: 6

Rahul Sharma
Rahul Sharma

Reputation: 5834

As partition is also one of the column in hive, every partition has value(assign using static or dynamic partition) and every partition is mapped to directory in HDFS, so it has to be additional column.


You may choose one the below option:

Let's say table DDL:

CREATE TABLE temp( id string) PARTITIONED BY (day int)

  1. If the data is organised day wise then add static partition:

ALTER TABLE xyz ADD PARTITION (day=00) location '/2017/02/02';

or

INSERT OVERWRITE TABLE xyz PARTITION (day=1) SELECT id FROM temp  WHERE dayOfTheYear(**timestamp**)=1;

  1. Generate day number using dynamic partition:

INSERT INTO TABLE xyz PARTITION (day) SELECT id , dayOfTheYear(day) FROM temp;

Hive doesn't have any dayOfTheYear function you create it.

Upvotes: 0

Related Questions