Vizram
Vizram

Reputation: 25

update partitioned hive table if we already have some data in it

I have a hive table which is partitioned by day and the data related to 03-02-2017 is loaded into it but, on next day I have data related to 03-03-2017. Now how to update hive table with my new data so that my hive warehouse directory should look like this

hive/warehouse/sample_database/sample_table/day=03-02-2017/data_part_0000
hive/warehouse/sample_database/sample_table/day=03-03-2017/data_part_0000

So please provide me the code for creating table as well as how to update the table when the new data sets added to this.

Upvotes: 1

Views: 3290

Answers (2)

David דודו Markovitz
David דודו Markovitz

Reputation: 44991

Here is a full demo that describes 2 basic options:

  1. Add partitions to the table. The HDFS directories will be created automatically.
  2. Add directories to the HDFS and apply msck repair on the table to add the partitions.

Note that I'm using the DATE type for the partitions.
The ANSI/ISO notation for date literal is date 'YYYY-MM-DD'

Even for older versions that did not support the DATE type I would strongly recommend to avoid any date format other than YYYY-MM-DD, for 2 reasons:
1. This is the only format supported by the date functions.
2. This format allows correct alphabetical comparison, e.g. -
'2017-01-22' > '2016-09-22' but '01-22-2017' < '09-22-2016'

Demo

bash

hdfs dfs -mkdir -p /hive/warehouse/sample_database/sample_table

hive

create external table sample_table
(
    i int
)
partitioned by (day date)
location '/hive/warehouse/sample_database/sample_table'
;

Option 1 - alter table ... add partition ...

hive

alter table sample_table add partition (day=date '2017-03-02');
alter table sample_table add partition (day=date '2017-03-03');

hive> show partitions sample_table;
OK
day=2017-03-02
day=2017-03-03
Time taken: 0.067 seconds, Fetched: 2 row(s)
hive> dfs -ls /hive/warehouse/sample_database/sample_table;
Found 2 items
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-02
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-03
hive> 

Option 2 - hdfs dfs -mkdir ... + msck repair table ...

bash

hdfs dfs -mkdir /hive/warehouse/sample_database/sample_table/day=2017-03-02
hdfs dfs -mkdir /hive/warehouse/sample_database/sample_table/day=2017-03-03

hive

msck repair

hive> show partitions sample_table;
OK
Time taken: 0.187 seconds
hive> msck repair table sample_table;
OK
Partitions not in metastore:    sample_table:day=2017-03-02 sample_table:day=2017-03-03
Repair: Added partition to metastore sample_table:day=2017-03-02
Repair: Added partition to metastore sample_table:day=2017-03-03
Time taken: 0.143 seconds, Fetched: 3 row(s)
hive> show partitions sample_table;
OK
day=2017-03-02
day=2017-03-03
Time taken: 0.076 seconds, Fetched: 2 row(s)
hive> dfs -ls /hive/warehouse/sample_database/sample_table;
Found 2 items
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-02
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-03
hive> 

Upvotes: 1

dbustosp
dbustosp

Reputation: 4478

First of all, remember that Hive tables point to somewhere in the HDFS, so that I am not sure what you mean with Hive data warehouse.

For creating the table, you should have to create a partitioned table by day (String), something like:

CREATE EXTERNAL TABLE myTable
partitioned by (day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.url'='/path/to/my/avro/schema/avro_schema.avsc')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/hive/warehouse/sample_database/sample_table';

In the above solution I am assuming you are treating with avro data. The important part actually is: partitioned by (day STRING). Here you are telling Hive that this table will be partitioned by day.

When a new dataset appears in the structure of the directory (hive/warehouse/sample_database/sample_table/day=03-03-2017/data_part_0000) and it is not added by an insert hive statement, you will have to run this command:

msck repair table myTable;

This way, a new partition will be added to the metastore of the table. Note that, you have another option:

alter table myTable add partition (day = '03-03-2017');

Upvotes: 0

Related Questions