Get deltas for a custom aggregation with date range

Question

I need to find an efficient way of creating a query the reports the deltas of an aggregation, with the start and end dates for the value.

Requirements

The source table includes start date, end date, category id, sub category id, and an indicator if the sub category is active.
The aggregation is for is_active on cat_id, and the result of the function should be 1 as long as any sub_category is also 1 for is_active.
If results of the aggregate function are the same for consecutive date ranges, then the date ranges should be combined to reduce the result set.
A category/sub category combination will never have overlapping dates, but other sub categories may cross each other's boundaries.

What I've Tried

I tried creating a CTE generating all possible ranges for a category and then joining back to the main query so that a sub category that spans multiple ranges was broken up. I then grouped by the ranges and did a MAX(is_active).

While this was a good start (all I needed to do at this point was combine consecutive ranges with the same value), the query was extraordinarily slow. I'm not as familiar with Postgres as I am with other flavors of SQL, and decided my time was better spent reaching out and getting help from someone more experienced.

Source Data

+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| id | start_dt   | end_dt     | cat_id | sub_cat_id | is_active | comment                                             |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| 1  | 2018-01-01 | 2018-01-31 | 1      | 1001       | 1         | (null)                                              |
| 2  | 2018-02-01 | 2018-02-14 | 1      | 1001       | 0         | (null)                                              |
| 3  | 2018-02-15 | 2018-02-28 | 1      | 1001       | 0         | cat 1 is_active is unchanged despite new record.    |
| 4  | 2018-03-01 | 2018-03-30 | 1      | 1001       | 1         | (null)                                              |
| 5  | 2018-01-01 | 2018-01-15 | 2      | 2001       | 1         | (null)                                              |
| 6  | 2018-01-01 | 2018-01-31 | 2      | 2002       | 1         | (null)                                              |
| 7  | 2018-01-15 | 2018-02-10 | 2      | 2001       | 0         | cat 2 should still be active until 2002 is inactive |
| 8  | 2018-02-01 | 2018-02-14 | 2      | 2002       | 0         | cat 2 is inactive                                   |
| 9  | 2018-02-10 | 2018-03-15 | 2      | 2001       | 0         | this record will cause trouble                      |
| 10 | 2018-02-15 | 2018-03-30 | 2      | 2002       | 1         | cat 2 should be active again                        |
| 11 | 2018-03-15 | 2018-03-30 | 2      | 2001       | 1         | cat 2 is_active is unchanged despite new record.    |
| 12 | 2018-04-01 | 2018-04-30 | 2      | 2001       | 0         | cat 2 ends in a zero                                |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+

Expected Result

+------------+------------+--------+-----------+
| start_dt   | end_dt     | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 | 1      | 1         |
| 2018-02-01 | 2018-02-28 | 1      | 0         |
| 2018-03-01 | 2018-03-30 | 1      | 1         |
| 2018-01-01 | 2018-01-31 | 2      | 1         |
| 2018-02-01 | 2018-02-14 | 2      | 0         |
| 2018-02-15 | 2018-03-30 | 2      | 1         |
| 2018-04-01 | 2018-04-30 | 2      | 0         |
+------------+------------+--------+-----------+

Here's a select statement to help you write your own tests.

SELECT id,start_dt::date start_date,end_dt::date end_date,cat_id,sub_cat_id,is_active::int is_active,comment
FROM (VALUES 
    (1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
    (2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
    (3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
    (4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
    (5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
    (6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
    (7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
    (8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
    (9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
    (10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
    (11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
    (12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')

) src ( "id","start_dt","end_dt","cat_id","sub_cat_id","is_active","comment" )

unutbu · Accepted Answer

WITH test AS (
    SELECT id, start_dt::date, end_dt::date, cat_id, sub_cat_id, is_active::int, comment  FROM ( VALUES 
        (1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
        (2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
        (3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
        (4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
        (5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
        (6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
        (7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
        (8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
        (9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
        (10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
        (11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
        (12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')
        ) test (id, start_dt, end_dt, cat_id, sub_cat_id, is_active, comment) 
    )
SELECT cat_id, start_date, end_date, active_state
FROM (
    SELECT cat_id, date as start_date, lead(date-1) over w as end_date
        , active_state, prev_active
        , nonactive_state, prev_nonactive
    FROM (
        SELECT cat_id, date 
            , active_state, prev_active
            , nonactive_state
            , lag(nonactive_state, 1, 0) over w as prev_nonactive
        FROM (
            SELECT cat_id, date, active_state, lag(active_state, 1, 0) over w as prev_active
                , (nonactive_state > active_state)::int as nonactive_state
            FROM (
                SELECT DISTINCT ON (cat_id, date)
                    cat_id, date
                    , (CASE WHEN sum(type) over w > 0 THEN 1 ELSE 0 END) as active_state
                    , (CASE WHEN sum(nonactive_type) over w > 0 THEN 1 ELSE 0 END) as nonactive_state
                FROM (
                    SELECT start_dt as date
                        , 1 as type
                        , cat_id
                        , 0 as nonactive_type
                    FROM test
                    WHERE is_active = 1
                  UNION ALL
                    SELECT end_dt + 1 as date
                        , -1 as type
                        , cat_id
                        , 0 as nonactive_type
                    FROM test
                    WHERE is_active = 1
                  UNION ALL
                    SELECT start_dt as date
                        , 0 as type
                        , cat_id
                        , 1 as nonactive_type
                    FROM test
                    WHERE is_active = 0
                  UNION ALL
                    SELECT end_dt + 1 as date
                        , 0 as type
                        , cat_id
                        , -1 as nonactive_type
                    FROM test
                    WHERE is_active = 0
                ) t
                WINDOW w as (partition by cat_id order by date)
                ORDER BY cat_id, date
            ) t2
            WINDOW w as (partition by cat_id order by date)
        ) t3
        WINDOW w as (partition by cat_id order by date)
    ) t4
    WHERE (active_state != prev_active) OR (nonactive_state != prev_nonactive)
    WINDOW w as (partition by cat_id order by date)
    ) t5
WHERE active_state = 1 OR nonactive_state = 1
ORDER BY cat_id, start_date

yields

| cat_id | start_date |   end_date | active_state |
|--------+------------+------------+--------------|
|      1 | 2018-01-01 | 2018-01-31 |            1 |
|      1 | 2018-02-01 | 2018-02-28 |            0 |
|      1 | 2018-03-01 | 2018-03-30 |            1 |
|      2 | 2018-01-01 | 2018-01-31 |            1 |
|      2 | 2018-02-01 | 2018-02-14 |            0 |
|      2 | 2018-02-15 | 2018-03-30 |            1 |
|      2 | 2018-04-01 | 2018-04-30 |            0 |

This combines the start_dt and end_dt dates into a single column, and introduces a type column which is 1 for start dates and -1 for end dates. Summing over the type produces a value which is positive when the corresponding date is inside a [start_dt, end_dt] interval, and which is 0 otherwise.

This is one of the ideas presented in Itzik Ben-Gan's Packing Intervals, but I first learned it from DSM (in the context of programming in Python/Pandas) here.

Usually when dealing with intervals using the above technique, the intervals define when dates are in an "on" state, and being not "on" automatically implies "off". In this problem, however, it appears rows where active_state = 1 imply the final active_state is "on" but dates outside those intervals are not necessarily "off". 2018-03-31 is an example of a date which is outside the active_state = 1 intervals but is not "off". Similarly, rows where active_state = 0 imply the final active_state is "off" so long as the dates do not intersect an interval whose active_state = 1.

To handle these two different kinds of intervals, I applied the above technique (summing +1/-1 types) twice: Once for rows where is_active = 1 and once for rows where is_active = 0. This gives us a handle on dates which are definitely in an active_state ("on") and dates which are definitely in a nonactive_state ("off"). Since active trumps nonactive, the dates which are deemed nonactive are trimmed using:

(nonactive_state > active_state)::int as nonactive_state

(That is, when active_state = 1 and nonactive_state = 1, the assignment above is used to change the nonactive_state to 0.)

Get deltas for a custom aggregation with date range

Answers (2)

Related Questions