Reputation: 498
I'm building a data warehouse with multiple layers storing the same data. All of the data in one of the middle layers is versioned with start and end dates, as if it was a type 2 slowly changing dimension. The problem occurs when I query these tables. There are typically more columns in the table than in the query, so adjacent versions in the query have different start and end dates, but are otherwise identical. I want to combine these versions, to show dates when columns in the query change, not when rows in the table change.
I have some SQL that almost works:
create table versions
(id int
, name varchar(100) Not null
, RowStartDate datetime Not null
, RowEndDate datetime Not null
, primary key (id,RowStartDate)
, check (RowStartDate < RowEndDate));
insert into versions values
(1,'A','2014-01-01','9999-12-31')
,(2,'B','2014-01-01','2014-12-31')
,(2,'B','2014-12-31','9999-12-31')
,(3,'C','2014-01-01','2014-12-31')
,(3,'CC','2014-12-31','2015-12-31')
,(3,'CC','2015-12-31','9999-12-31')
,(4,'D','2014-01-01','2014-12-31')
,(4,'DD','2014-12-31','2015-12-31')
,(4,'DD','2015-12-31','2016-12-31')
,(4,'D','2016-12-31','9999-12-31')
,(5,'E','2014-01-01','2014-12-31')
,(5,'E','2014-12-31','2015-12-31')
,(5,'E','2015-12-31','2016-12-31')
,(5,'E','2016-12-31','2017-12-31')
,(5,'E','2017-12-31','9999-12-31')
;
WITH CTE_detect_duplicates AS (SELECT [id]
,[name]
,[RowStartDate]
,[RowEndDate]
,LAST_VALUE(RowEndDate) OVER (PARTITION BY id, name ORDER BY RowStartDate, RowEndDate ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) as LastEndDate
,rank() OVER (PARTITION BY id, name ORDER BY RowStartDate, RowEndDate) as duplicateNumber
FROM versions
)
SELECT [id]
,[name]
,[RowStartDate]
,LastEndDate as RowEndDate
FROM CTE_detect_duplicates
WHERE duplicateNumber = 1
The problem here is that it returns two rows for id "4", when three are desired. Actual:
id name RowStartDate RowEndDate 4 D 2014-01-01 00:00:00.000 9999-12-31 00:00:00.000 4 DD 2014-12-31 00:00:00.000 2016-12-31 00:00:00.000Desired:
id name RowStartDate RowEndDate 4 D 2014-01-01 00:00:00.000 2014-12-31 00:00:00.000 4 DD 2014-12-31 00:00:00.000 2016-12-31 00:00:00.000 4 D 2016-12-31 00:00:00.000 9999-12-31 00:00:00.000The value D is not correct for the period when the value DD is correct, so the version dates are incorrect on the first row (4,'D') from the my query.
I want to be able to remove these duplicates in pure SQL, or an in-line table valued function (I have a generator that creates multi-statement table valued functions that do this, but the resulting functions perform poorly). Does anyone have any ideas?
Upvotes: 0
Views: 2812
Reputation: 5669
The following query, containing multiple CTE's compresses the date ranges of the updates and removes duplicate values.
1 First ranks are assigned within each id group, based on the RowStartDate.
2 Next, the maximum rank (next_rank_no
) of the range of ranks which has the same value for NAME
is determined. Thus, for the example data, row 1 of id=5 would have next_rank_no=5 and row 2 of id=4 would have next_rank_no=3. This version only handles the NAME
column. If you want to handle additional columns, they must be included in the condition as well. For example, if you want to include a LOCATION
column, then the join conditions would read as:
left join sorted_versions sv2 on sv2.id = sv1.id and sv2.rank_no > sv1.rank_no and sv2.name = sv1.name and sv2.location = sv1.location
left join sorted_versions sv3 on sv3.id = sv1.id and sv3.rank_no > sv1.rank_no and (sv3.name <> sv1.name or sv3.location <> sv1.location)
3 Finally, the first row for each id is selected. Then, the row corresponding to the next_rank_no
is selected in a recursive fashion.
with sorted_versions as --ranks are assigned within each id group
(
select
v1.id,
v1.name,
v1.RowStartDate,
v1.RowEndDate,
rank() over (partition by v1.id order by v1.RowStartDate) rank_no
from versions v1
left join versions v2 on (v1.id = v2.id and v2.RowStartDate = v1.RowEndDate)
),
next_rank as --the maximum rank of the range of ranks which has the same value for NAME
(
select
sv1.id id, sv1.rank_no rank_no, COALESCE(min(sv3.rank_no)-1 , COALESCE(max(sv2.rank_no), sv1.rank_no)) next_rank_no
from sorted_versions sv1
left join sorted_versions sv2 on sv2.id = sv1.id and sv2.rank_no > sv1.rank_no and sv2.name = sv1.name
left join sorted_versions sv3 on sv3.id = sv1.id and sv3.rank_no > sv1.rank_no and sv3.name <> sv1.name
group by sv1.id, sv1.rank_no
),
versions_cte as --the rowenddate of the "maximum rank" is selected
(
select sv.id, sv.name, sv.rowstartdate, sv3.rowenddate, nr.next_rank_no rank_no
from sorted_versions sv
inner join next_rank nr on sv.id = nr.id and sv.rank_no = nr.rank_no and sv.rank_no = 1
inner join sorted_versions sv3 on nr.id = sv3.id and nr.next_rank_no = sv3.rank_no
union all
select
sv2.id,
sv2.name,
sv2.rowstartdate,
sv3.rowenddate,
nr.next_rank_no
from versions_cte vc
inner join sorted_versions sv2 on sv2.id = vc.id and sv2.rank_no = vc.rank_no + 1
inner join next_rank nr on sv2.id = nr.id and sv2.rank_no = nr.rank_no
inner join sorted_versions sv3 on nr.id = sv3.id and nr.next_rank_no = sv3.rank_no
)
select id, name, rowstartdate, rowenddate
from versions_cte
order by id, rowstartdate;
Upvotes: 2