Reputation: 20544
I need to find an efficient way of creating a query the reports the deltas of an aggregation, with the start and end dates for the value.
Requirements
What I've Tried
I tried creating a CTE generating all possible ranges for a category and then joining back to the main query so that a sub category that spans multiple ranges was broken up. I then grouped by the ranges and did a MAX(is_active).
While this was a good start (all I needed to do at this point was combine consecutive ranges with the same value), the query was extraordinarily slow. I'm not as familiar with Postgres as I am with other flavors of SQL, and decided my time was better spent reaching out and getting help from someone more experienced.
Source Data
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| id | start_dt | end_dt | cat_id | sub_cat_id | is_active | comment |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| 1 | 2018-01-01 | 2018-01-31 | 1 | 1001 | 1 | (null) |
| 2 | 2018-02-01 | 2018-02-14 | 1 | 1001 | 0 | (null) |
| 3 | 2018-02-15 | 2018-02-28 | 1 | 1001 | 0 | cat 1 is_active is unchanged despite new record. |
| 4 | 2018-03-01 | 2018-03-30 | 1 | 1001 | 1 | (null) |
| 5 | 2018-01-01 | 2018-01-15 | 2 | 2001 | 1 | (null) |
| 6 | 2018-01-01 | 2018-01-31 | 2 | 2002 | 1 | (null) |
| 7 | 2018-01-15 | 2018-02-10 | 2 | 2001 | 0 | cat 2 should still be active until 2002 is inactive |
| 8 | 2018-02-01 | 2018-02-14 | 2 | 2002 | 0 | cat 2 is inactive |
| 9 | 2018-02-10 | 2018-03-15 | 2 | 2001 | 0 | this record will cause trouble |
| 10 | 2018-02-15 | 2018-03-30 | 2 | 2002 | 1 | cat 2 should be active again |
| 11 | 2018-03-15 | 2018-03-30 | 2 | 2001 | 1 | cat 2 is_active is unchanged despite new record. |
| 12 | 2018-04-01 | 2018-04-30 | 2 | 2001 | 0 | cat 2 ends in a zero |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
Expected Result
+------------+------------+--------+-----------+
| start_dt | end_dt | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 | 1 | 1 |
| 2018-02-01 | 2018-02-28 | 1 | 0 |
| 2018-03-01 | 2018-03-30 | 1 | 1 |
| 2018-01-01 | 2018-01-31 | 2 | 1 |
| 2018-02-01 | 2018-02-14 | 2 | 0 |
| 2018-02-15 | 2018-03-30 | 2 | 1 |
| 2018-04-01 | 2018-04-30 | 2 | 0 |
+------------+------------+--------+-----------+
Here's a select statement to help you write your own tests.
SELECT id,start_dt::date start_date,end_dt::date end_date,cat_id,sub_cat_id,is_active::int is_active,comment
FROM (VALUES
(1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
(2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
(3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
(4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
(5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
(6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
(7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
(8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
(9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
(10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
(11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
(12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')
) src ( "id","start_dt","end_dt","cat_id","sub_cat_id","is_active","comment" )
Upvotes: 4
Views: 152
Reputation: 880089
WITH test AS (
SELECT id, start_dt::date, end_dt::date, cat_id, sub_cat_id, is_active::int, comment FROM ( VALUES
(1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
(2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
(3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
(4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
(5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
(6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
(7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
(8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
(9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
(10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
(11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
(12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')
) test (id, start_dt, end_dt, cat_id, sub_cat_id, is_active, comment)
)
SELECT cat_id, start_date, end_date, active_state
FROM (
SELECT cat_id, date as start_date, lead(date-1) over w as end_date
, active_state, prev_active
, nonactive_state, prev_nonactive
FROM (
SELECT cat_id, date
, active_state, prev_active
, nonactive_state
, lag(nonactive_state, 1, 0) over w as prev_nonactive
FROM (
SELECT cat_id, date, active_state, lag(active_state, 1, 0) over w as prev_active
, (nonactive_state > active_state)::int as nonactive_state
FROM (
SELECT DISTINCT ON (cat_id, date)
cat_id, date
, (CASE WHEN sum(type) over w > 0 THEN 1 ELSE 0 END) as active_state
, (CASE WHEN sum(nonactive_type) over w > 0 THEN 1 ELSE 0 END) as nonactive_state
FROM (
SELECT start_dt as date
, 1 as type
, cat_id
, 0 as nonactive_type
FROM test
WHERE is_active = 1
UNION ALL
SELECT end_dt + 1 as date
, -1 as type
, cat_id
, 0 as nonactive_type
FROM test
WHERE is_active = 1
UNION ALL
SELECT start_dt as date
, 0 as type
, cat_id
, 1 as nonactive_type
FROM test
WHERE is_active = 0
UNION ALL
SELECT end_dt + 1 as date
, 0 as type
, cat_id
, -1 as nonactive_type
FROM test
WHERE is_active = 0
) t
WINDOW w as (partition by cat_id order by date)
ORDER BY cat_id, date
) t2
WINDOW w as (partition by cat_id order by date)
) t3
WINDOW w as (partition by cat_id order by date)
) t4
WHERE (active_state != prev_active) OR (nonactive_state != prev_nonactive)
WINDOW w as (partition by cat_id order by date)
) t5
WHERE active_state = 1 OR nonactive_state = 1
ORDER BY cat_id, start_date
yields
| cat_id | start_date | end_date | active_state |
|--------+------------+------------+--------------|
| 1 | 2018-01-01 | 2018-01-31 | 1 |
| 1 | 2018-02-01 | 2018-02-28 | 0 |
| 1 | 2018-03-01 | 2018-03-30 | 1 |
| 2 | 2018-01-01 | 2018-01-31 | 1 |
| 2 | 2018-02-01 | 2018-02-14 | 0 |
| 2 | 2018-02-15 | 2018-03-30 | 1 |
| 2 | 2018-04-01 | 2018-04-30 | 0 |
This combines the start_dt
and end_dt
dates into a single column, and
introduces a type
column which is 1 for start dates and -1 for end dates.
Summing over the type
produces a value which is positive when the
corresponding date
is inside a [start_dt, end_dt]
interval, and which is 0
otherwise.
This is one of the ideas presented in Itzik Ben-Gan's Packing Intervals, but I first learned it from DSM (in the context of programming in Python/Pandas) here.
Usually when dealing with intervals using the above technique, the intervals
define when dates are in an "on" state, and being not "on" automatically implies "off".
In this problem, however, it appears
rows where active_state = 1
imply the final active_state
is "on" but dates outside those intervals are not necessarily "off". 2018-03-31
is an example of a date which is outside
the active_state = 1
intervals but is not "off".
Similarly, rows where active_state = 0
imply the final active_state
is "off" so long as the dates do not intersect an interval whose active_state = 1
.
To handle these two different kinds of intervals, I applied the above technique (summing +1/-1 type
s) twice: Once for rows where is_active = 1
and once for rows where is_active = 0
.
This gives us a handle on dates which are definitely in an active_state
("on") and dates which are definitely in a nonactive_state
("off").
Since active trumps nonactive, the dates which are deemed nonactive are trimmed using:
(nonactive_state > active_state)::int as nonactive_state
(That is, when active_state = 1
and nonactive_state = 1
, the assignment above is used to change the nonactive_state
to 0
.)
Upvotes: 1
Reputation: 32695
So, a given date is active if any of the subcategories on that date is active. In other words, if at least one subcategory is active, the date is considered active. If there are no active subcategories on a given date, that date is inactive. This piece of logic wasn't clear for me in the original question at first.
I mentioned an article by Itzik Ben-Gan Packing Intervals, which is one way to approach it.
With this approach you can pack all active intervals ignoring inactive intervals completely. The gaps that are left after packing active intervals would be inactive.
If you never have dates that are neither active, nor inactive, this is the final answer. If you can have such "indefined" dates, things may become tricky.
A totally different approach is to use a calendar table (a permanent table, or a series of dates generated on the fly). Join each row of the original table to the calendar table to expand it and make one row for each date in the given interval.
Then group them all by Category and Date and set is_active flag as MAX (if at least one subcategory has is_active=1 for that date, the MAX will be 1, i.e. active as well).
This approach is way more easy to understand and should work reasonably well if the length of intervals is not too long.
Something like this:
SELECT
Calendar.dt
,src.cat_id
,MAX(src.is_active) AS is_active
-- we don't even need to know sub_cat_id
FROM
src
INNER JOIN Calendar
ON Calendar.dt >= src.start_dt
AND Calendar.dt <= src.end_dt
GROUP BY
Calendar.dt
,src.cat_id
So, you'll get one row per date and category. Now you need to merge consecutive dates back into intervals. You can use the Packing Intervals method again or some simpler variation of gaps-and-islands.
Sample data
WITH src AS
(
SELECT id,start_dt::date start_dt,end_dt::date end_dt,cat_id,sub_cat_id,is_active,comment
FROM (VALUES
(1, '2018-01-01', '2018-01-31', 1, 1001, 1, null),
(2, '2018-02-01', '2018-02-14', 1, 1001, 0, null),
(3, '2018-02-15', '2018-02-28', 1, 1001, 0, 'cat 1 is_active is unchanged despite new record.'),
(4, '2018-03-01', '2018-03-30', 1, 1001, 1, null),
(5, '2018-01-01', '2018-01-15', 2, 2001, 1, null),
(6, '2018-01-01', '2018-01-31', 2, 2002, 1, null),
(7, '2018-01-15', '2018-02-10', 2, 2001, 0, 'cat 2 should still be active until 2002 is inactive'),
(8, '2018-02-01', '2018-02-14', 2, 2002, 0, 'cat 2 is inactive'),
(9, '2018-02-10', '2018-03-15', 2, 2001, 0, 'cat 2 is_active is unchanged despite new record.'),
(10, '2018-02-15', '2018-03-30', 2, 2002, 1, 'cat 2 should be active agai'),
(11, '2018-03-15', '2018-03-30', 2, 2001, 1, 'cat 2 is_active is unchanged despite new record.'),
(12, '2018-04-01', '2018-04-30', 2, 2001, 0, 'cat 2 ends in 0.')
) src ( id,start_dt,end_dt,cat_id,sub_cat_id,is_active,comment)
)
,Calendar AS
(
-- OP Note: Union of all dates from source produced 30% faster results.
-- OP Note 2: Including the cat_id (which was indexed FK), Made Query 8x faster.
SELECT cat_id, start_dt dt FROM src
UNION SELECT cat_id, end_dt dt FROM src
/*SELECT dt::date dt
FROM (
SELECT MIN(start_dt) min_start, MAX(end_dt) max_end
FROM src
) max_ranges
CROSS JOIN generate_series(min_start, max_end, '1 day'::interval) dt*/
)
The main query
Examine result of each intermediate CTE to fully understand how it works.
-- expand intervals into individual dates
,CTE_Dates
AS
(
SELECT
Calendar.dt
,src.cat_id
,MAX(src.is_active) AS is_active
-- we don't even need to know sub_cat_id
FROM
src
INNER JOIN Calendar
ON Calendar.dt >= src.start_dt
AND Calendar.dt <= src.end_dt
AND Calender.cat_id = src.cat_id
GROUP BY
Calendar.dt
,src.cat_id
)
-- simple gaps-and-islands
,CTE_rn
AS
(
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY cat_id ORDER BY dt) AS rn1
,ROW_NUMBER() OVER (PARTITION BY cat_id, is_active ORDER BY dt) AS rn2
FROM CTE_Dates
)
-- diff of row numbers gives us a group's "ID"
-- condense each island and gap back into interval using simple GROUP BY
SELECT
MIN(dt) AS start_dt
,MAX(dt) AS end_dt
,cat_id
,is_active
FROM CTE_rn
GROUP BY
cat_id
,is_active
,rn1 - rn2
ORDER BY
cat_id
,start_dt
;
Second variant without generic Calendar
It may perform better, because this variant doesn't have to scan the src
table (twice) to make a temporary list of dates, sort that list to remove duplicates and then there is no join to that temporary list of dates that most likely doesn't have any supporting index.
But, it generates more rows.
-- remove Calendar CTE above,
-- use generate_series() to generate the exact range of dates we need
-- without joining to generic Calendar table
-- expand intervals into individual dates
,CTE_Dates
AS
(
SELECT
Dates.dt
,src.cat_id
,MAX(src.is_active) AS is_active
-- we don't even need to know sub_cat_id
FROM
src
INNER JOIN LATERAL
(
SELECT dt::date
FROM generate_series(src.start_dt, src.end_dt, '1 day'::interval) AS s(dt)
) AS Dates ON true
GROUP BY
Dates.dt
,src.cat_id
)
-- simple gaps-and-islands
,CTE_rn
AS
(
SELECT
*
,ROW_NUMBER() OVER (PARTITION BY cat_id ORDER BY dt) AS rn1
,ROW_NUMBER() OVER (PARTITION BY cat_id, is_active ORDER BY dt) AS rn2
FROM CTE_Dates
)
-- diff of row numbers gives us a group's "ID"
-- condense each island and gap back into interval using simple GROUP BY
SELECT
MIN(dt) AS start_dt
,MAX(dt) AS end_dt
,cat_id
,is_active
FROM CTE_rn
GROUP BY
cat_id
,is_active
,rn1 - rn2
ORDER BY
cat_id
,start_dt
;
Result
+------------+------------+--------+-----------+
| start_dt | end_dt | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 | 1 | 1 |
| 2018-02-01 | 2018-02-28 | 1 | 0 |
| 2018-03-01 | 2018-03-30 | 1 | 1 |
| 2018-01-01 | 2018-01-31 | 2 | 1 |
| 2018-02-01 | 2018-02-14 | 2 | 0 |
| 2018-02-15 | 2018-03-30 | 2 | 1 |
| 2018-04-01 | 2018-04-30 | 2 | 0 |
+------------+------------+--------+-----------+
Also, it is known that CTE is an "optimisation barrier" in Postgres, so if you inline these CTEs into a single query its performance may change. You need to test on your system with your data.
Upvotes: 1