Reputation: 763
I have two tables like below in hive
stg
. This table is bascially snapshot table which will be overwritten everyday
This table data will be inserted to history
table every day in new partition
Day 1
stg
table
+-----+------------+------------+
| pk | from_d | to_d |
+-----+------------+------------+
| 111 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
| 222 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
| 333 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+
history
This table is partitioned by column load_date
+-----+------------+------------+------------+
| pk | from_d | to_d |load_date |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-01-01 | 2019-01-01 |
+-----+------------+------------+------------+
Problem statement:
1) If I receieve any PK
that is already present in history
table then I will need to update the to_d
column for that PK in history.
2) The to_d
column should have value of from_d - 1 day
value in the STG
table.
3) Also need to consider if the PK
is again coming in another day then the update should happen to only the latest record in history not all the records for the same PK.
Please check PK 111
in the below data examples.
Day 2
stg
+-----+------------+------------+
| pk | from_d | to_d |
+-----+------------+------------+
| 111 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+
| 333 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+
| 444 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+
history
table to be updated like below
+-----+------------+------------+------------+
| pk | from_d | to_d | load_date |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-02-02 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 111 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 333 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 444 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
To achieve the above first I have updated the history table using
insert overwrite table histroy partition(load_date)
select pk, from_d,
case when pk = '111' then '2019-02-01' when pk = '333' then '2019-02-01' else to_d end as to_d,
load_date
from history;
once this is done then I have inserted the day 2 stg table to history table
Day 3
stg
+-----+------------+------------+
| pk | from_d | to_d |
+-----+------------+------------+
| 111 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+
| 222 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+
| 555 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+
history
to be uodated like below
+-----+------------+------------+------------+
| pk | from_d | to_d | load_date |
+-----+------------+------------+------------+
| 111 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 222 | 2019-01-01 | 2019-03-02 | 2019-01-01 |
+-----+------------+------------+------------+
| 333 | 2019-01-01 | 2019-02-01 | 2019-01-01 |
+-----+------------+------------+------------+
| 111 | 2019-02-02 | 2019-03-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 333 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 444 | 2019-02-02 | 2019-02-02 | 2019-02-02 |
+-----+------------+------------+------------+
| 111 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+
| 222 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+
| 555 | 2019-03-03 | 2019-03-03 | 2019-03-03 |
+-----+------------+------------+------------+
To achieve the above I have done
insert overwrite table histroy partition(load_date)
select pk, from_d,
case when pk = '111' then '2019-03-02' else to_d end as to_d,
load_date
from history
where load_date = '2019-02-02';
insert overwrite table history partition(load_date)
select pk, from_d,
case when pk = '222' then '2019-03-02' else to_d end as to_d,
load_date
from history
where load_date = '2019-01-01';
Then insert the STG
table data
I am achieving what I want but this is a tedious process and there must be better approaches than this one.
Note: I don't want to use the Update
statements for this problem. Insert overwrite is what I am looking for
Upvotes: 0
Views: 519
Reputation: 49270
You could do this in 2 steps which might be better in terms of performance.
1.Create a temporary table with load_date
ranking for each pk
. This table can be overwritten every time the job/script runs.
create table if not exists rank_load_date_pk as
select pk,from_d,to_d,row_number() over(partition by pk order by load_date desc) as rnum
from history
;
2.There are 3 scenarios which should be handled next.
pk
s existing in both stg
and history
. In this case, the latest row prior should be selected with the appropriate calculation for to_d
pk
s from history
. In this case, select all the non-latest rows for each pk
stg
SQL
insert overwrite table history partition(load_date)
--common pk's with the latest load_date rows
select r.pk,r.from_d,coalesce(date_sub(s.to_d,1),r.to_d) as to_d,coalesce(s.to_d,r.load_date) as load_date
from rank_load_date_pk r
left join stg s on s.pk = r.pk
where r.rnum = 1
union all
--remaining rows
select pk,from_d,to_d,load_date
from rank_load_date_pk
where rnum > 1
union all
--stg all rows
select pk,from_d,to_d,to_d as load_date
from stg
;
Upvotes: 0
Reputation: 5480
You can do like below
First create a table and assign row_number for each row partitioned by PK
like below
create table stg_row_num as select *,
row_number() over ( partition by pk order by load_date desc) as row_num from stg;
The above query should give you table like below
+---+----------+----------+----------+--------+
| pk| from_d| to_d| load_date| row_num|
+---+----------+----------+----------+--------+
|111|2019-03-03|2019-03-03|2019-03-03| 1|
|111|2019-02-02|2019-02-02|2019-02-02| 2|
|111|2019-01-01|2019-01-01|2019-01-01| 3|
|222|2019-03-03|2019-03-03|2019-03-03| 1|
|222|2019-01-01|2019-01-01|2019-01-01| 2|
|333|2019-02-02|2019-02-02|2019-02-02| 1|
|333|2019-01-01|2019-01-01|2019-01-01| 2|
|444|2019-02-02|2019-02-02|2019-02-02| 1|
|555|2019-03-03|2019-03-03|2019-03-03| 1|
+---+----------+----------+----------+--------+
Once you have the above table then using LAG
function like below
select pk, from_d,
case when row_num = 1 then to_d else date_sub(lag(to_d) over (), 1) end as to_d,
row_num from table;
This will give you the desired result
+---+----------+----------+-------------------+
| pk| from_d| to_d|row_number_window_0|
+---+----------+----------+-------------------+
|111|2019-03-03|2019-03-03| 1|
|111|2019-02-02|2019-03-02| 2|
|111|2019-01-01|2019-02-01| 3|
|222|2019-03-03|2019-03-03| 1|
|222|2019-01-01|2019-03-02| 2|
|333|2019-02-02|2019-02-02| 1|
|333|2019-01-01|2019-02-01| 2|
|444|2019-02-02|2019-02-02| 1|
|555|2019-03-03|2019-03-03| 1|
+---+----------+----------+-------------------+
Hope this helps
Upvotes: 1