Reputation: 792
How to insert overwrite partitions only if partitions not exists in HIVE?
Just as title. I'm working on something that always needs to rewrite hive tables. I have tables that has multiple partitions and I only want to insert new partitions without change exist partitions when I rerun the code after change.
Upvotes: 3
Views: 4028
Reputation: 8012
One option is to join (left join on partition columns as keys) the source data set with distinct partition columns from the target table and filter out the partitions which are in common. You know what I mean; your Hive query should looks like this:
insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
src.column1,
src.column2,
....,
src.columnN,
src.partition_column1,
src.partition_column2,
....,
src.partition_columnN
from
source src
left join
(
select distinct
partition_column1,
partition_column2,
....,
partition_columnN
from
target
)
tgt
on src.partition_column1 = tgt.partition_column1
and src.partition_column1 = tgt.partition_column1
...
src.partition_columnN = tgt.partition_columnN
where
tgt.partition_column1 is null
or tgt.partition_column2 is null
...
tgt.partition_columnN is null;
A simple demonstration of this logic is given below:
Let's create two tables named orders, and orders_source. The order table is going to be the target table and orders_source is the source table. For simplicity I'm using the similar schema for both the tables.
CREATE TABLE `orders`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
CREATE TABLE `orders_source`(
`id` int,
`customer_id` int,
`shipper_id` int)
PARTITIONED BY (
`state` string,
`order_date` date)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
'orc.bloom.filter.columns'='id,customer_id',
'orc.compress'='SNAPPY',
'orc.compress.size'='262144',
'orc.create.index'='true',
'orc.row.index.stride'='3000',
'orc.stripe.size'='268435456');
Next, insert some sample records for testing the logic:
set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;
insert overwrite table orders partition (state, order_date)
select
orde.id,
orde.customer_id,
orde.shipper_id,
orde.state,
orde.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10241 as id,
20481 as customer_id,
30721 as shipper_id,
'GA' as state,
'2019-09-01' as order_date
)
orde;
insert overwrite table orders_source partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
(
select
10240 as id,
20480 as customer_id,
30720 as shipper_id,
'CA' as state,
'2019-09-01' as order_date
union all
select
10242 as id,
20482 as customer_id,
30722 as shipper_id,
'CA' as state,
'2019-09-02' as order_date
union all
select
10243 as id,
20483 as customer_id,
30723 as shipper_id,
'FL' as state,
'2019-09-02' as order_date
union all
select
10244 as id,
20484 as customer_id,
30724 as shipper_id,
'TX' as state,
'2019-09-02' as order_date
)
orso;
Now, let's check the data we have inserted in both the tables before running our actual business logic:
hive (default)> select * from orders_source;
OK
orders_source.id orders_source.customer_id orders_source.shipper_id orders_source.state orders_source.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10241 20481 30721 GA 2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)
Next, execute our logic to select the records from the source table and insert into the target table. You can run the below query:
hive (default)> select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
OK
orso.id orso.customer_id orso.shipper_id orso.state orso.order_date
10243 20483 30723 FL 2019-09-02
10244 20484 30724 TX 2019-09-02
10242 20482 30722 CA 2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)
You can see the above results.
Finally insert the records into target table by issuing the below query:
insert overwrite table orders partition (state, order_date)
select
orso.id,
orso.customer_id,
orso.shipper_id,
orso.state,
orso.order_date
from
orders_source orso
left join
(
select distinct
state,
order_date
from
orders
)
orde
on orso.state = orde.state
and orso.order_date = orde.order_date
where
orde.state is null
or orde.order_date is null;
Now, let's verify the data in target table after the insert operation.
hive (default)> select * from orders;
OK
orders.id orders.customer_id orders.shipper_id orders.state orders.order_date
10240 20480 30720 CA 2019-09-01
10242 20482 30722 CA 2019-09-02
10243 20483 30723 FL 2019-09-02
10241 20481 30721 GA 2019-09-01
10244 20484 30724 TX 2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)
That's it. You're all set!
Upvotes: 1
Reputation: 38290
You can join with existing partition list and add where it is NULL condition (not joined only). Also you can use NOT EXISTS (it will generate the same plan as left join in Hive) Like this:
insert overwrite table target_table partition (partition_key)
select col1, ... coln, s.partition_key
from source s
left join (select distinct partition_key --existing partitions
from target_table
) t on s.partition_key=t.partition_key
where t.partition_key is NULL; --no partitions exists in the target
Upvotes: 2