Pirastrino
Pirastrino

Reputation: 190

How to find overlapping date ranges in Postgresql between multiple rows?

I have a table of employees which contains information of their position, team, team leader etc. Every row has an valid_from and valid_to field that make a specific date range with other specification. If any information changed it will automatically creates a new row with valid_from = now() and set previous row to valid_to = now() - interval '1 day'.

The problem is that this automated row generation can creates overlapping date ranges as it is generated from the different spreadsheets and I need to manually correct them. So I'd like to write an sql query that will find these overlaps.

I made a view from the table:

select
  employee_id,
  rank() over (partition by employee_id order by valid_from) as rank,
  valid_from,
  valid_to
from myTable

which gives me data as below:

| employee_id | rank | valid_from | valid_to   |
|-------------|------|------------|------------|
| 1000        | 1    | 2016-11-28 | 2017-06-30 |
| 1000        | 2    | 2017-07-01 | 2018-02-26 |
| 1000        | 3    | 2018-02-25 | 2018-03-12 |
| 1001        | 1    | 2016-11-28 | 2017-07-30 |
| 1001        | 2    | 2017-07-31 | 2017-07-31 |
| 1001        | 3    | 2017-08-01 | 2017-08-01 |

and I'd like to add one more column with a true/false value or something similar if current row date range overlaps any other row within the same employee_id:

| employee_id | rank | valid_from | valid_to   | overlapp_exist |
|-------------|------|------------|------------|----------------|
| 1000        | 1    | 2016-11-28 | 2017-06-30 | false          |
| 1000        | 2    | 2017-07-01 | 2018-02-26 | true           |
| 1000        | 3    | 2018-02-25 | 2018-03-12 | true           |
| 1001        | 1    | 2016-11-28 | 2017-07-30 | false          |
| 1001        | 2    | 2017-07-31 | 2017-07-31 | false          |
| 1001        | 3    | 2017-08-01 | 2017-08-01 | false          |

The table contains about 50k rows. The query will be triggered from the frontend on demand so it should be optimised to be able to send back data as quickly as possible. Any help appreciated!

Upvotes: 3

Views: 3627

Answers (3)

Gordon Linoff
Gordon Linoff

Reputation: 1271171

You can use exists:

select t.*,
       (exists (select 1
                from mytable t2
                where t2.employee_id = t.employee_id and
                      t2.valid_from < t.valid_to and
                      t2.valid_to > t.valid_from and
                      (t2.valid_from <> t.valid_from or
                       t2.valid_to <> t.valid_to
                      )
                )
        ) as overlaps_flag
from mytable t;

Note: This will be confused by exact duplicates. If you have a unique identifier for each row (which I strongly recommend), then use that column for the last two conditions.

You can use this subquery in an update query, if your intention is to add a column to the data itself.

EDIT

Another method that probably has better performance is to use window functions:

select t.*,
       (max(valid_to) over (partition by employee_id over order by valid_from rows between unbounded preceding and 1 preceding) > valid_to
       ) as overlaps_flag
from mytable t;

I'm not sure if that should be > or >=.

And to handle starts on the same date (and not return null):

select t.*,
       ( (max(valid_to) over (partition by employee_id over order by valid_from rows between unbounded preceding and 1 preceding) > valid_to
         ) or
         count(*) over (partition by employee_id, valid_from) >= 2
       ) as overlaps_flag
from mytable t;

Upvotes: 1

clemens
clemens

Reputation: 17722

If you want generally avoid overlapping dates, a constraint might be a good solution:

ALTER TABLE my_table ADD CONSTRAINT my_table_date_range_check EXCLUDE USING GIST (
        employee_id WITH =,
        DATERANGE(valid_from, valid_to) WITH &&
);

Note: The date ranges in this definition are half-open.

Upvotes: 2

Laurenz Albe
Laurenz Albe

Reputation: 248235

It won't be fast, as every row has to be matched with every other row:

SELECT a.*, b.*
FROM mytable AS a
   JOIN mytable AS b
      ON daterange(a.valid_from, a.valid_to) && daterange(b.valid_from, b.valid_to)
WHERE (a.valid_from, a.valid_to) <= (b.valid_from, b.valid_to);

It might be better to have an exclusion constraint on the table that prevents such data from being added in the first place.

Upvotes: 4

Related Questions