Calculate minimum set of record versions from a subset of columns

Question

I am trying to crack an SQL problem that seems very simple to explain:

I have a table with several versions of a given entity (like an SCD type 2 dimension) - the table temp.test below.

Content:

DROP TABLE IF EXISTS temp.test;
CREATE TEMP TABLE temp.test (
  row_id   INTEGER IDENTITY (1, 1),
  id       VARCHAR,
  start_ts TIMESTAMP,
  end_ts   TIMESTAMP,
  level1   VARCHAR,
  level2   VARCHAR
);

INSERT INTO temp.test (id, start_ts, end_ts, level1, level2) VALUES
  ('a', '1970-01-01 00:00:00.000000', '2017-12-31 23:59:59.999999', 'ABC1', 'ABC2'),
  ('a', '2018-01-01 00:00:00.000000', '2018-12-31 23:59:59.999999', 'DEF1', 'DEF2'),
  ('a', '2019-01-01 00:00:00.000000', '2019-12-31 23:59:59.999999', 'ABC1', 'GHI2'),
  ('a', '2020-01-01 00:00:00.000000', '2020-12-31 23:59:59.999999', 'ABC1', 'JKL2');

I basically want to end up with:

-- Desired output
('a', '1970-01-01 00:00:00.000000', '2017-12-31 23:59:59.999999', 'ABC1'),
('a', '2018-01-01 00:00:00.000000', '2018-12-31 23:59:59.999999', 'DEF1'),
('a', '2019-01-01 00:00:00.000000', '2020-12-31 23:59:59.999999', 'ABC1'),

Meaning, I want the minimum set of versions of column level1. Note that the 3rd and 4th row would be duplicated, but in that case, we are getting the min(start_ts) and max(end_ts) to calculate the version.

Here's what I tried, but I failed miserably...

-- Wrong
SELECT
  id,
  min(start_ts) AS start_ts,
  max(end_ts)   AS end_ts,
  level1
FROM temp.test
GROUP BY id, level1
ORDER BY 2;

-- Wrong
SELECT DISTINCT
  id,
  FIRST_VALUE(start_ts) OVER(PARTITION BY id, level1 ORDER BY start_ts) AS start_ts,
  LAST_VALUE(end_ts) OVER(PARTITION BY id, level1 ORDER BY start_ts)    AS end_ts,
  level1
FROM temp.test
ORDER BY 2;

There has to be some magical way of having the output I need. What are your suggestions?

Note: I am using Snowflake, but this is just standard SQL.

Gordon Linoff · Accepted Answer

This is a gaps-and-islands problem. In this case, I would use the row_number() approach:

SELECT id, level1,
       MIN(start_ts) as start_ts, MAX(end_ts) as end_ts
FROM (SELECT t.*,
             ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_ts) as seqnum_i,
             ROW_NUMBER() OVER (PARTITION BY id, level1 ORDER BY start_ts) as seqnum_il,
      FROM temp.test t
     ) t
GROUP BY id, level1, (seqnum_i - seqnum_il);

Note that this assumes that there are no gaps in the start and end timestamps.

How this works is not immediately obvious. I generally recommend that you just stare at the results of the subquery. It is generally pretty obvious that the difference between the two row numbers identifies the groups that you want to aggregate.

Calculate minimum set of record versions from a subset of columns

Answers (1)

Related Questions