Reputation: 25
I have a hive table which is partitioned by day and the data related to 03-02-2017 is loaded into it but, on next day I have data related to 03-03-2017. Now how to update hive table with my new data so that my hive warehouse directory should look like this
hive/warehouse/sample_database/sample_table/day=03-02-2017/data_part_0000
hive/warehouse/sample_database/sample_table/day=03-03-2017/data_part_0000
So please provide me the code for creating table as well as how to update the table when the new data sets added to this.
Upvotes: 1
Views: 3290
Reputation: 44991
Here is a full demo that describes 2 basic options:
Note that I'm using the DATE type for the partitions.
The ANSI/ISO notation for date literal is date 'YYYY-MM-DD'
Even for older versions that did not support the DATE type I would strongly recommend to avoid any date format other than YYYY-MM-DD, for 2 reasons:
1. This is the only format supported by the date functions.
2. This format allows correct alphabetical comparison, e.g. -
'2017-01-22' > '2016-09-22' but '01-22-2017' < '09-22-2016'
bash
hdfs dfs -mkdir -p /hive/warehouse/sample_database/sample_table
hive
create external table sample_table
(
i int
)
partitioned by (day date)
location '/hive/warehouse/sample_database/sample_table'
;
hive
alter table sample_table add partition (day=date '2017-03-02');
alter table sample_table add partition (day=date '2017-03-03');
hive> show partitions sample_table;
OK
day=2017-03-02
day=2017-03-03
Time taken: 0.067 seconds, Fetched: 2 row(s)
hive> dfs -ls /hive/warehouse/sample_database/sample_table;
Found 2 items
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-02
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-03
hive>
bash
hdfs dfs -mkdir /hive/warehouse/sample_database/sample_table/day=2017-03-02
hdfs dfs -mkdir /hive/warehouse/sample_database/sample_table/day=2017-03-03
hive
msck repair
hive> show partitions sample_table;
OK
Time taken: 0.187 seconds
hive> msck repair table sample_table;
OK
Partitions not in metastore: sample_table:day=2017-03-02 sample_table:day=2017-03-03
Repair: Added partition to metastore sample_table:day=2017-03-02
Repair: Added partition to metastore sample_table:day=2017-03-03
Time taken: 0.143 seconds, Fetched: 3 row(s)
hive> show partitions sample_table;
OK
day=2017-03-02
day=2017-03-03
Time taken: 0.076 seconds, Fetched: 2 row(s)
hive> dfs -ls /hive/warehouse/sample_database/sample_table;
Found 2 items
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-02
... 2017-03-04 23:31 /hive/warehouse/sample_database/sample_table/day=2017-03-03
hive>
Upvotes: 1
Reputation: 4478
First of all, remember that Hive tables point to somewhere in the HDFS, so that I am not sure what you mean with Hive data warehouse.
For creating the table, you should have to create a partitioned table by day (String), something like:
CREATE EXTERNAL TABLE myTable
partitioned by (day STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.url'='/path/to/my/avro/schema/avro_schema.avsc')
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/hive/warehouse/sample_database/sample_table';
In the above solution I am assuming you are treating with avro data. The important part actually is: partitioned by (day STRING)
. Here you are telling Hive that this table will be partitioned by day.
When a new dataset appears in the structure of the directory (hive/warehouse/sample_database/sample_table/day=03-03-2017/data_part_0000
) and it is not added by an insert hive statement
, you will have to run this command:
msck repair table myTable;
This way, a new partition will be added to the metastore of the table. Note that, you have another option:
alter table myTable add partition (day = '03-03-2017');
Upvotes: 0