Reputation: 407
I am trying to crack an SQL problem that seems very simple to explain:
temp.test
below.Content:
DROP TABLE IF EXISTS temp.test;
CREATE TEMP TABLE temp.test (
row_id INTEGER IDENTITY (1, 1),
id VARCHAR,
start_ts TIMESTAMP,
end_ts TIMESTAMP,
level1 VARCHAR,
level2 VARCHAR
);
INSERT INTO temp.test (id, start_ts, end_ts, level1, level2) VALUES
('a', '1970-01-01 00:00:00.000000', '2017-12-31 23:59:59.999999', 'ABC1', 'ABC2'),
('a', '2018-01-01 00:00:00.000000', '2018-12-31 23:59:59.999999', 'DEF1', 'DEF2'),
('a', '2019-01-01 00:00:00.000000', '2019-12-31 23:59:59.999999', 'ABC1', 'GHI2'),
('a', '2020-01-01 00:00:00.000000', '2020-12-31 23:59:59.999999', 'ABC1', 'JKL2');
I basically want to end up with:
-- Desired output
('a', '1970-01-01 00:00:00.000000', '2017-12-31 23:59:59.999999', 'ABC1'),
('a', '2018-01-01 00:00:00.000000', '2018-12-31 23:59:59.999999', 'DEF1'),
('a', '2019-01-01 00:00:00.000000', '2020-12-31 23:59:59.999999', 'ABC1'),
Meaning, I want the minimum set of versions of column level1
. Note that the 3rd and 4th row would be duplicated, but in that case, we are getting the min(start_ts)
and max(end_ts)
to calculate the version.
Here's what I tried, but I failed miserably...
-- Wrong
SELECT
id,
min(start_ts) AS start_ts,
max(end_ts) AS end_ts,
level1
FROM temp.test
GROUP BY id, level1
ORDER BY 2;
-- Wrong
SELECT DISTINCT
id,
FIRST_VALUE(start_ts) OVER(PARTITION BY id, level1 ORDER BY start_ts) AS start_ts,
LAST_VALUE(end_ts) OVER(PARTITION BY id, level1 ORDER BY start_ts) AS end_ts,
level1
FROM temp.test
ORDER BY 2;
There has to be some magical way of having the output I need. What are your suggestions?
Note: I am using Snowflake, but this is just standard SQL.
Upvotes: 2
Views: 204
Reputation: 1270391
This is a gaps-and-islands problem. In this case, I would use the row_number()
approach:
SELECT id, level1,
MIN(start_ts) as start_ts, MAX(end_ts) as end_ts
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_ts) as seqnum_i,
ROW_NUMBER() OVER (PARTITION BY id, level1 ORDER BY start_ts) as seqnum_il,
FROM temp.test t
) t
GROUP BY id, level1, (seqnum_i - seqnum_il);
Note that this assumes that there are no gaps in the start and end timestamps.
How this works is not immediately obvious. I generally recommend that you just stare at the results of the subquery. It is generally pretty obvious that the difference between the two row numbers identifies the groups that you want to aggregate.
Upvotes: 2