Reputation: 815
I have a table in a PSA where I am capturing changes to records in a source table. Let's say it looks like this:
+-----+------------+----------------+------------------+------------------+
| PK | Check_cols | Dont_care_cols | start_ts | end_ts |
+-----+------------+----------------+------------------+------------------+
| 123 | abc | def | 1/1/20 00:10:00 | 1/2/20 13:13:23 |
| 123 | abc | dhf | 1/2/20 13:13:23 | 1/3/20 04:21:00 |
| 123 | abc | dhz | 1/3/20 04:21:00 | 1/5/20 12:15:00 |
| 123 | abd | dyz | 1/5/20 12:15:00 | 1/9/20 15:16:00 |
| 123 | abc | dyz | 1/9/20 15:16:00 | null |
| 456 | ghi | jkl | 1/2/20 03:45:00 | 1/10/20 00:00:00 |
| 456 | lmn | opq | 1/10/20 00:00:00 | null |
+-----+------------+----------------+------------------+------------------+
I would like to build a type-2 dimension (tracks changes with record start and stop times) from that table using only the values of check_cols, like the one shown below. I am looking for a pure SQL solution, with no looping.
check_cols is comprised of multiple columns, but I will use a md5 hash to look for changes. Since my dimension only cares about check_cols there are situations where the timestamp records aren't what I need. For instance if a value in the dont_care_cols changes, but none of the check_cols values change.
From the data above, I want the following result set:
+-----+------------+------------------+------------------+
| PK | Check_cols | start_ts | end_ts |
+-----+------------+------------------+------------------+
| 123 | abc | 1/1/20 00:10:00 | 1/5/20 12:15:00 |
| 123 | abd | 1/5/20 12:15:00 | 1/9/20 15:16:00 |
| 123 | abc | 1/9/20 15:16:00 | null |
| 456 | ghi | 1/2/20 03:45:00 | 1/10/20 00:00:00 |
| 456 | lmn | 1/10/20 00:00:00 | null |
+-----+------------+------------------+------------------+
I've tried using window functions to compare lead and lag values, get mins and maxes, etc, but I can't figure out this edge case shown for PK 123 in the first table. I also have not found a solution via google/stackoverflow/etc. Most methods rely on daily snapshots running. I want to be able to rebuild the target table if I have a logic change. Anyone have thoughts?
Upvotes: 1
Views: 190
Reputation: 7369
I don't know if this is the best answer or whether it solves all of your use-cases, but give it a try and let me know if there is an edge case that stumbles over it. It's a bit of a hack. Also, I did add a few records to the use-case:
CREATE OR REPLACE TEMP TABLE tran_data (pk int, check_cols varchar, dont_care_cols varchar, start_ts timestamp, end_ts timestamp);
INSERT INTO tran_data
SELECT *
FROM (VALUES(123,'abc','def',TO_TIMESTAMP('1/1/20 00:10:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/2/20 13:13:23','MM/DD/YY hh:mi:ss')),
(123,'abc','dhf',TO_TIMESTAMP('1/2/20 13:13:23','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/3/20 04:21:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dhz',TO_TIMESTAMP('1/3/20 04:21:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/5/20 12:15:00','MM/DD/YY hh:mi:ss')),
(123,'abd','dyz',TO_TIMESTAMP('1/5/20 12:15:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/9/20 15:16:00','MM/DD/YY hh:mi:ss')),
(123,'abd','dyz',TO_TIMESTAMP('1/9/20 15:16:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/11/20 14:14:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dyz',TO_TIMESTAMP('1/11/20 14:14:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/14/20 09:14:00','MM/DD/YY hh:mi:ss')),
(123,'abc','dyz',TO_TIMESTAMP('1/14/20 09:14:00','MM/DD/YY hh:mi:ss'),null),
(456,'ghi','jkl',TO_TIMESTAMP('1/2/20 03:45:00','MM/DD/YY hh:mi:ss'),TO_TIMESTAMP('1/10/20 00:00:00','MM/DD/YY hh:mi:ss')),
(456,'lmn','opq',TO_TIMESTAMP('1/10/20 00:00:00','MM/DD/YY hh:mi:ss'),null)
);
From there, I tried to find a way to create "groups" using a method that I hope will stand up to all of your use-cases:
SELECT DISTINCT
PK
, check_cols
, FIRST_VALUE(start_ts) OVER (PARTITION BY PK, check_cols, group_num ORDER BY start_ts) as new_start_ts
, LAST_VALUE(end_ts) OVER (PARTITION BY PK, check_cols, group_num ORDER BY start_ts) as new_end_ts
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY PK, check_cols ORDER BY start_ts) as group_cnt
, group_cnt - pk_row as group_num
, *
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY PK ORDER BY start_ts) as pk_row
, IFNULL(LAG(check_cols) OVER (PARTITION BY PK ORDER BY start_ts),check_cols) as prev_check_cols
, *
FROM tran_data
)
)
ORDER BY pk, new_start_ts;
Upvotes: 2