anahnarciso
anahnarciso

Reputation: 407

Calculate minimum set of record versions from a subset of columns

I am trying to crack an SQL problem that seems very simple to explain:

  1. I have a table with several versions of a given entity (like an SCD type 2 dimension) - the table temp.test below.

Content:

DROP TABLE IF EXISTS temp.test;
CREATE TEMP TABLE temp.test (
  row_id   INTEGER IDENTITY (1, 1),
  id       VARCHAR,
  start_ts TIMESTAMP,
  end_ts   TIMESTAMP,
  level1   VARCHAR,
  level2   VARCHAR
);

INSERT INTO temp.test (id, start_ts, end_ts, level1, level2) VALUES
  ('a', '1970-01-01 00:00:00.000000', '2017-12-31 23:59:59.999999', 'ABC1', 'ABC2'),
  ('a', '2018-01-01 00:00:00.000000', '2018-12-31 23:59:59.999999', 'DEF1', 'DEF2'),
  ('a', '2019-01-01 00:00:00.000000', '2019-12-31 23:59:59.999999', 'ABC1', 'GHI2'),
  ('a', '2020-01-01 00:00:00.000000', '2020-12-31 23:59:59.999999', 'ABC1', 'JKL2');
  1. I basically want to end up with:

    -- Desired output
    ('a', '1970-01-01 00:00:00.000000', '2017-12-31 23:59:59.999999', 'ABC1'),
    ('a', '2018-01-01 00:00:00.000000', '2018-12-31 23:59:59.999999', 'DEF1'),
    ('a', '2019-01-01 00:00:00.000000', '2020-12-31 23:59:59.999999', 'ABC1'),
    
  2. Meaning, I want the minimum set of versions of column level1. Note that the 3rd and 4th row would be duplicated, but in that case, we are getting the min(start_ts) and max(end_ts) to calculate the version.

  3. Here's what I tried, but I failed miserably...

    -- Wrong
    SELECT
      id,
      min(start_ts) AS start_ts,
      max(end_ts)   AS end_ts,
      level1
    FROM temp.test
    GROUP BY id, level1
    ORDER BY 2;
    
    -- Wrong
    SELECT DISTINCT
      id,
      FIRST_VALUE(start_ts) OVER(PARTITION BY id, level1 ORDER BY start_ts) AS start_ts,
      LAST_VALUE(end_ts) OVER(PARTITION BY id, level1 ORDER BY start_ts)    AS end_ts,
      level1
    FROM temp.test
    ORDER BY 2;
    

There has to be some magical way of having the output I need. What are your suggestions?

Note: I am using Snowflake, but this is just standard SQL.

Upvotes: 2

Views: 204

Answers (1)

Gordon Linoff
Gordon Linoff

Reputation: 1270391

This is a gaps-and-islands problem. In this case, I would use the row_number() approach:

SELECT id, level1,
       MIN(start_ts) as start_ts, MAX(end_ts) as end_ts
FROM (SELECT t.*,
             ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_ts) as seqnum_i,
             ROW_NUMBER() OVER (PARTITION BY id, level1 ORDER BY start_ts) as seqnum_il,
      FROM temp.test t
     ) t
GROUP BY id, level1, (seqnum_i - seqnum_il);

Note that this assumes that there are no gaps in the start and end timestamps.

How this works is not immediately obvious. I generally recommend that you just stare at the results of the subquery. It is generally pretty obvious that the difference between the two row numbers identifies the groups that you want to aggregate.

Upvotes: 2

Related Questions