update partitioned hive table if we already have some data in it

Question

I have a hive table which is partitioned by day and the data related to 03-02-2017 is loaded into it but, on next day I have data related to 03-03-2017. Now how to update hive table with my new data so that my hive warehouse directory should look like this

hive/warehouse/sample_database/sample_table/day=03-02-2017/data_part_0000
hive/warehouse/sample_database/sample_table/day=03-03-2017/data_part_0000

So please provide me the code for creating table as well as how to update the table when the new data sets added to this.

David דודו Markovitz · Accepted Answer

Here is a full demo that describes 2 basic options:

Add partitions to the table. The HDFS directories will be created automatically.
Add directories to the HDFS and apply msck repair on the table to add the partitions.

Note that I'm using the DATE type for the partitions.
The ANSI/ISO notation for date literal is date 'YYYY-MM-DD'

Even for older versions that did not support the DATE type I would strongly recommend to avoid any date format other than YYYY-MM-DD, for 2 reasons:
1. This is the only format supported by the date functions.
2. This format allows correct alphabetical comparison, e.g. -
'2017-01-22' > '2016-09-22' but '01-22-2017' < '09-22-2016'

Demo

bash

hdfs dfs -mkdir -p /hive/warehouse/sample_database/sample_table

hive

create external table sample_table
(
    i int
)
partitioned by (day date)
location '/hive/warehouse/sample_database/sample_table'
;

Option 1 - alter table ... add partition ...

hive

alter table sample_table add partition (day=date '2017-03-02');
alter table sample_table add partition (day=date '2017-03-03');

hive> show partitions sample_table;
OK
day=2017-03-02
day=2017-03-03
Time taken: 0.067 seconds, Fetched: 2 row(s)
hive> dfs -ls /hive/warehouse/sample_database/sample_table;
Found 2 items
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-02
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-03
hive>

Option 2 - hdfs dfs -mkdir ... + msck repair table ...

bash

hdfs dfs -mkdir /hive/warehouse/sample_database/sample_table/day=2017-03-02
hdfs dfs -mkdir /hive/warehouse/sample_database/sample_table/day=2017-03-03

hive

msck repair

hive> show partitions sample_table;
OK
Time taken: 0.187 seconds
hive> msck repair table sample_table;
OK
Partitions not in metastore:    sample_table:day=2017-03-02 sample_table:day=2017-03-03
Repair: Added partition to metastore sample_table:day=2017-03-02
Repair: Added partition to metastore sample_table:day=2017-03-03
Time taken: 0.143 seconds, Fetched: 3 row(s)
hive> show partitions sample_table;
OK
day=2017-03-02
day=2017-03-03
Time taken: 0.076 seconds, Fetched: 2 row(s)
hive> dfs -ls /hive/warehouse/sample_database/sample_table;
Found 2 items
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-02
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-03
hive>

update partitioned hive table if we already have some data in it

Answers (2)

Demo

Option 1 - alter table ... add partition ...

Option 2 - hdfs dfs -mkdir ... + msck repair table ...

Related Questions