Yang
Yang

Reputation: 792

How to insert overwrite partitions only if partitions not exists in HIVE?

How to insert overwrite partitions only if partitions not exists in HIVE?

Just as title. I'm working on something that always needs to rewrite hive tables. I have tables that has multiple partitions and I only want to insert new partitions without change exist partitions when I rerun the code after change.

Upvotes: 3

Views: 4028

Answers (2)

1218985
1218985

Reputation: 8012

One option is to join (left join on partition columns as keys) the source data set with distinct partition columns from the target table and filter out the partitions which are in common. You know what I mean; your Hive query should looks like this:

insert overwrite table target_table partition (partition_column1, partition_column2, ..., partition_columnN)
select
   src.column1,
   src.column2,
   ....,
   src.columnN,
   src.partition_column1,
   src.partition_column2,
   ....,
   src.partition_columnN
from
   source src 
   left join
      (
         select distinct
            partition_column1,
            partition_column2,
            ....,
            partition_columnN
         from
            target
      )
      tgt 
      on src.partition_column1 = tgt.partition_column1 
      and src.partition_column1 = tgt.partition_column1
      ...
      src.partition_columnN = tgt.partition_columnN 
where
   tgt.partition_column1 is null 
   or tgt.partition_column2 is null
   ...
   tgt.partition_columnN is null;

A simple demonstration of this logic is given below:

Let's create two tables named orders, and orders_source. The order table is going to be the target table and orders_source is the source table. For simplicity I'm using the similar schema for both the tables.

CREATE TABLE `orders`(
  `id` int, 
  `customer_id` int, 
  `shipper_id` int)
PARTITIONED BY ( 
  `state` string,
  `order_date` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='id,customer_id', 
  'orc.compress'='SNAPPY', 
  'orc.compress.size'='262144', 
  'orc.create.index'='true', 
  'orc.row.index.stride'='3000', 
  'orc.stripe.size'='268435456');

CREATE TABLE `orders_source`(
  `id` int, 
  `customer_id` int, 
  `shipper_id` int)
PARTITIONED BY ( 
  `state` string,
  `order_date` date)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.orc.OrcSerde' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
TBLPROPERTIES (
  'orc.bloom.filter.columns'='id,customer_id', 
  'orc.compress'='SNAPPY', 
  'orc.compress.size'='262144', 
  'orc.create.index'='true', 
  'orc.row.index.stride'='3000', 
  'orc.stripe.size'='268435456');

Next, insert some sample records for testing the logic:

set hive.exec.dynamic.partition = true;
set hive.exec.dynamic.partition.mode = nonstrict;

insert overwrite table orders partition (state, order_date) 
select
   orde.id,
   orde.customer_id,
   orde.shipper_id,
   orde.state,
   orde.order_date 
from
   (
      select
         10240 as id,
         20480 as customer_id,
         30720 as shipper_id,
         'CA' as state,
         '2019-09-01' as order_date 
      union all
      select
         10241 as id,
         20481 as customer_id,
         30721 as shipper_id,
         'GA' as state,
         '2019-09-01' as order_date
   )
   orde;

insert overwrite table orders_source partition (state, order_date) 
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   (
      select
         10240 as id,
         20480 as customer_id,
         30720 as shipper_id,
         'CA' as state,
         '2019-09-01' as order_date 
      union all
      select
         10242 as id,
         20482 as customer_id,
         30722 as shipper_id,
         'CA' as state,
         '2019-09-02' as order_date 
      union all
      select
         10243 as id,
         20483 as customer_id,
         30723 as shipper_id,
         'FL' as state,
         '2019-09-02' as order_date 
      union all
      select
         10244 as id,
         20484 as customer_id,
         30724 as shipper_id,
         'TX' as state,
         '2019-09-02' as order_date
   )
   orso;

Now, let's check the data we have inserted in both the tables before running our actual business logic:

hive (default)> select * from orders_source;
OK
orders_source.id    orders_source.customer_id   orders_source.shipper_id    orders_source.state orders_source.order_date
10240   20480   30720   CA  2019-09-01
10242   20482   30722   CA  2019-09-02
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
Time taken: 0.085 seconds, Fetched: 4 row(s)

hive (default)> select * from orders;
OK
orders.id   orders.customer_id  orders.shipper_id   orders.state    orders.order_date
10240   20480   30720   CA  2019-09-01
10241   20481   30721   GA  2019-09-01
Time taken: 0.073 seconds, Fetched: 2 row(s)

Next, execute our logic to select the records from the source table and insert into the target table. You can run the below query:

hive (default)> select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;
OK
orso.id orso.customer_id    orso.shipper_id orso.state  orso.order_date
10243   20483   30723   FL  2019-09-02
10244   20484   30724   TX  2019-09-02
10242   20482   30722   CA  2019-09-02
Time taken: 11.113 seconds, Fetched: 3 row(s)

You can see the above results.

Finally insert the records into target table by issuing the below query:

insert overwrite table orders partition (state, order_date)
select
   orso.id,
   orso.customer_id,
   orso.shipper_id,
   orso.state,
   orso.order_date 
from
   orders_source orso 
   left join
      (
         select distinct
            state,
            order_date 
         from
            orders
      )
      orde 
      on orso.state = orde.state 
      and orso.order_date = orde.order_date 
where
   orde.state is null 
   or orde.order_date is null;

Now, let's verify the data in target table after the insert operation.

hive (default)> select * from orders;
OK
orders.id   orders.customer_id  orders.shipper_id   orders.state    orders.order_date
10240   20480   30720   CA  2019-09-01
10242   20482   30722   CA  2019-09-02
10243   20483   30723   FL  2019-09-02
10241   20481   30721   GA  2019-09-01
10244   20484   30724   TX  2019-09-02
Time taken: 0.074 seconds, Fetched: 5 row(s)

That's it. You're all set!

Upvotes: 1

leftjoin
leftjoin

Reputation: 38290

You can join with existing partition list and add where it is NULL condition (not joined only). Also you can use NOT EXISTS (it will generate the same plan as left join in Hive) Like this:

   insert overwrite table target_table partition (partition_key)
    select col1, ... coln, s.partition_key
      from source s 
           left join (select distinct partition_key --existing partitions
                       from target_table
                     ) t on s.partition_key=t.partition_key
     where t.partition_key is NULL; --no partitions exists in the target

Upvotes: 2

Related Questions