ptlx24
ptlx24

Reputation: 33

SQL query to find gaps within a column of dates

I have a table with status and date for every day and I'm trying to find out when the statuses change and if there's gaps within each status change / how many days were of a certain status.

+------+----------------+--------+
| user | date           | status |
+----- +----------------+--------+
|    1 | 12/01/2017     | open   |
|    1 | 12/02/2017     | open   |
|    1 | 12/03/2017     | open   |
|    1 | 12/04/2017     | closed |
|    1 | 12/05/2017     | closed |
|    1 | 12/06/2017     | open   |
|    1 | 12/07/2017     | open   |
+------+----------------+--------+

Expected output:

+------+------------+----------------+-------- ----+------------+
| user | status     | days_in_status | min         | max        |
+----- +------------+----------------+-------------+------------+
|    1 | open       | 3              | 12/01/2017  | 12/03/2017 |
|    1 | closed     | 2              | 12/04/2017  | 12/05/2017 |
|    1 | open       | 2              | 12/06/2017  | 12/07/2017 |
+------+------ -----+----------------+-------------+-- ---------+

Upvotes: 3

Views: 2529

Answers (3)

Eric Leschinski
Eric Leschinski

Reputation: 153832

SQL query to find the width of date gaps within dates

Fully contained example borrowing from Gordon's answer above that shows island widths as well as island gaps, copy and paste into postgresql and run it.

drop table if exists foobar; 
CREATE TABLE foobar( tick text, date_val date ); 

insert into foobar values('XYZ', '2021-01-03');  --island 1 has width 2
insert into foobar values('XYZ', '2021-01-04');  --island 1
insert into foobar values('XYZ', '2021-05-09');  --island 2 has width 3
insert into foobar values('XYZ', '2021-05-10');  --island 2 
insert into foobar values('XYZ', '2021-05-11');  --island 2
insert into foobar values('XYZ', '2021-07-07');  --island 3 has width 4
insert into foobar values('XYZ', '2021-07-08');  --island 3
insert into foobar values('XYZ', '2021-07-09');  --island 3
insert into foobar values('XYZ', '2021-07-10');  --island 3 
insert into foobar values('XYZ', '2022-10-10');  --island 4 has width 1

select tick, island_width, min_val, max_val, 
       min_val - lag(max_val) over (order by max_val) as gap_width from  
( 
  select tick, count(*) as island_width, 
         min(date_val) min_val, max(date_val) max_val 
  from ( 
    select t.*, 
    row_number() over ( partition by tick order by date_val ) as seqnum 
    from foobar t where tick = 'XYZ' 
    ) t 
  group by tick, date_val - seqnum * interval '1 day' 
) t2 order by max_val desc

Prints:

┌──────┬──────────────┬────────────┬────────────┬───────────┐ 
│ tick │ island_width │  min_val   │  max_val   │ gap_width │ 
├──────┼──────────────┼────────────┼────────────┼───────────┤ 
│ XYZ  │            1 │ 2022-10-10 │ 2022-10-10 │       457 │ 
│ XYZ  │            4 │ 2021-07-07 │ 2021-07-10 │        57 │ 
│ XYZ  │            3 │ 2021-05-09 │ 2021-05-11 │       125 │ 
│ XYZ  │            2 │ 2021-01-03 │ 2021-01-04 │         ¤ │ 
└──────┴──────────────┴────────────┴────────────┴───────────┘ 

Details: In this data there are 4 islands. There are 3 gaps between the 4 islands. The last island has a gap. The first island has a null gap.

The SQL finds the days between every min_date and max_date by subtracting a day and decrementing rownumber until all dates are accounted for. partition somehow groups the islands by date and a 2 day moving window using over and lag with default lookbehind 1. Black Box Demon magic aside it goes fast enough and all unit tests pass.

enter image description here

With help from walkthrough here: https://bertwagner.com/posts/gaps-and-islands

Same as above but make gaps count hours rather than days:

drop table if exists foobar; 
CREATE TABLE foobar( tick text, date_val timestamp); 
insert into foobar values('XYZ', '2021-01-03 12:00:00');  
insert into foobar values('XYZ', '2021-01-03 11:00:00'); 
insert into foobar values('XYZ', '2021-01-03 10:00:00'); 
insert into foobar values('XYZ', '2021-01-03 08:00:00');  --2 hr gap
insert into foobar values('XYZ', '2021-01-03 07:00:00'); 
insert into foobar values('XYZ', '2021-01-03 06:00:00'); 
insert into foobar values('XYZ', '2021-01-03 03:00:00');  --3 hr gap
select tick, island_width, min_val, max_val, 
       min_val - lag(max_val) over (order by max_val) as gap_width from 
( 
  select tick, count(*) as island_width, 
         min(date_val) min_val, max(date_val) max_val 
  from ( 
    select t.*, 
    row_number() over ( partition by tick order by date_val ) as seqnum 
    from foobar t where tick = 'XYZ' 
    ) t 
  group by tick, date_val - seqnum * interval '1 hour' 
) t2 order by max_val desc 

Prints:

┌──────┬──────────────┬─────────────────────┬─────────────────────┬───────────┐
│ tick │ island_width │       min_val       │       max_val       │ gap_width │
├──────┼──────────────┼─────────────────────┼─────────────────────┼───────────┤
│ XYZ  │            3 │ 2021-01-03 10:00:00 │ 2021-01-03 12:00:00 │ 02:00:00  │
│ XYZ  │            3 │ 2021-01-03 06:00:00 │ 2021-01-03 08:00:00 │ 03:00:00  │
│ XYZ  │            1 │ 2021-01-03 03:00:00 │ 2021-01-03 03:00:00 │ ¤         │
└──────┴──────────────┴─────────────────────┴─────────────────────┴───────────┘
 

Upvotes: 1

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656321

In Postgres you can subtract integer from date directly:

SELECT "user", status
     , count(*) AS days_in_status, min(date) AS min, max(date) AS max
FROM  (
   SELECT "user", status, date
        , date - row_number() OVER (PARTITION BY "user", status ORDER BY date)::int AS grp_date
   FROM   tbl
   ) t
GROUP BY "user", status, grp_date
ORDER BY "user", min;

db<>fiddle here

See:

Aside: user is a reserved word, you would avoid it as actual column name.

Upvotes: 0

Gordon Linoff
Gordon Linoff

Reputation: 1269503

This is a type of gaps-and-islands problem. In this case, subtracting a sequential number from each day is probably the simplest solution for identifying the "islands":

select user, status, count(*) as num_days, min(date), max(date)
from (select t.*,
             row_number() over (partition by user, status order by date) as seqnum
      from t
     ) t
group by user, status, date - seqnum * interval '1 day'

Upvotes: 7

Related Questions