cis
cis

Reputation: 1377

Exclude overlapping periods in time aggregate function

I have a table containing each a start and and end date:

DROP TABLE temp_period;

CREATE TABLE public.temp_period
(
  id integer NOT NULL,
  "startDate" date,
  "endDate" date
);

INSERT INTO temp_period(id,"startDate","endDate") VALUES(1,'2010-01-01','2010-03-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(2,'2013-05-17','2013-07-18');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(3,'2010-02-15','2010-05-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(7,'2014-01-01','2014-12-31');
INSERT INTO temp_period(id,"startDate","endDate") VALUES(56,'2014-03-31','2014-06-30');

Now I want to know the total duration of all periods stored there. I need just the time as an interval. That's pretty easy:

SELECT sum(age("endDate","startDate")) FROM temp_period;

However, the problem is: Those periods do overlap. And I want to eliminate all overlapping periods, so that I get the total amount of time which is covered by at least one record in the table.

You see, there are quite some gaps in between the times, so passing the smallest start date and the most recent end date to the age function won't do the trick. However, I thought about doing that and subtracting the total amount of gaps, but no elegant way to do that came into my mind.

I use PostgreSQL 9.6.

Upvotes: 1

Views: 1421

Answers (5)

Gilles
Gilles

Reputation: 11

Actually there is a case that is not covered by the previous examples. What if we have such a period ?

INSERT INTO temp_period(id,"startDate","endDate") VALUES(100,'2010-01-03','2010-02-10');

We have the following intervals:

 Interval No. |                  | start_date |                |  end_date
--------------+------------------+------------+----------------+------------
            1 |  Interval start  | 2010-01-01 |  Interval end  | 2010-03-31
            2 |  Interval start  | 2010-01-03 |  Interval end  | 2010-02-10
            3 |  Interval start  | 2010-02-15 |  Interval end  | 2010-05-31
            4 |  Interval start  | 2013-05-17 |  Interval end  | 2013-07-18
            5 |  Interval start  | 2014-01-01 |  Interval end  | 2014-12-31
            6 |  Interval start  | 2014-03-31 |  Interval end  | 2014-06-30

Even though segment 3 overlaps segment 1, it's seen as a new segment, hence the (wrong) result:

 sum
-----
 620
(1 row)

The solution is to tweak the core of the query

CASE WHEN start_date < lag(end_date) OVER (ORDER BY start_date, end_date) then NULL ELSE start_date END

needs to be replaced by

CASE WHEN start_date < max(end_date) OVER (ORDER BY start_date, end_date rows between unbounded preceding and 1 preceding) then NULL ELSE start_date END

then it works as expected

 sum
-----
 576
(1 row)

Summary:

SELECT sum(e - s)
  FROM (
    SELECT left_edge as s, max(end_date) as e
    FROM (   
      SELECT start_date, end_date, max(new_start) over (ORDER BY start_date, end_date) as left_edge
      FROM ( 
        SELECT start_date, end_date, CASE WHEN start_date < max(end_date) OVER (ORDER BY start_date, end_date rows between unbounded preceding and 1 preceding) then NULL ELSE start_date END AS new_start
        FROM temp_period
      ) s1
    ) s2
    GROUP BY left_edge
  ) s3;

Upvotes: 1

Gilles
Gilles

Reputation: 11

Beware: the answer by Laurenz Albe has a huge scalability issue.

I was more than happy when I found it. I customized it for our needs. We deployed to staging and very soon, the server took several minutes to return the results.

Then I found this answer on postgresql.org. Much more efficient. https://wiki.postgresql.org/wiki/Range_aggregation

SELECT sum(e - s)
FROM (
  SELECT left_edge as s, max(end_date) as e
  FROM (   
    SELECT start_date, end_date, max(new_start) over (ORDER BY start_date, end_date) as left_edge
    FROM ( 
      SELECT start_date, end_date, CASE WHEN start_date < lag(end_date) OVER (ORDER BY start_date, end_date) then NULL ELSE start_date END AS new_start
      FROM temp_period
      ) s1
    ) s2
  GROUP BY left_edge
  ) s3;

Result:

 sum
-----
 576
(1 row)

Upvotes: 0

Laurenz Albe
Laurenz Albe

Reputation: 246818

What about this:

WITH
   /* get all time points where something changes */
   points AS (
       SELECT "startDate" AS p
       FROM temp_period
       UNION SELECT "endDate"
       FROM temp_period
   ),
   /*
    * Get all date ranges between these time points.
    * The first time range will start with NULL,
    * but that will be excluded in the next CTE anyway.
    */
   inter AS (
      SELECT daterange(
                lag(p) OVER (ORDER BY p),
                p
             ) i
      FROM points
   ),
   /*
    * Get all date ranges that are contained
    * in at least one of the intervals.
    */
   overlap AS (
      SELECT DISTINCT i
      FROM inter
         CROSS JOIN temp_period
      WHERE i <@ daterange("startDate", "endDate")
   )
/* sum the lengths of the date ranges */
SELECT sum(age(upper(i), lower(i)))
FROM overlap;

For your data it will return:

┌──────────┐
│ interval │
├──────────┤
│ 576 days │
└──────────┘
(1 row)

Upvotes: 1

Trung Duong
Trung Duong

Reputation: 3475

You could try to use recursive cte to calculate the period. For each record, we will check if it's overlapped with previous records. If it is, we only calculate the period that is not overlapping.

WITH RECURSIVE days_count AS 
  ( 
         SELECT startDate, 
                endDate, 
                AGE(endDate, startDate) AS total_days, 
                rowSeq 
         FROM   ordered_data 
         WHERE  rowSeq = 1 
         UNION ALL 
         SELECT     GREATEST(curr.startDate, prev.endDate)                                            AS startDate,
                    GREATEST(curr.endDate, prev.endDate)                                              AS endDate,
                    AGE(GREATEST(curr.endDate, prev.endDate), GREATEST(curr.startDate, prev.endDate)) AS total_days,
                    curr.rowSeq 
         FROM       ordered_data curr 
         INNER JOIN days_count prev 
         ON         curr.rowSeq > 1 
         AND        curr.rowSeq = prev.rowSeq + 1), 
ordered_data AS 
  ( 
           SELECT   *, 
                    ROW_NUMBER() OVER (ORDER BY startDate) AS rowSeq 
           FROM     temp_period) 
SELECT SUM(total_days) AS total_days
FROM   days_count;

I've created a demo here

Upvotes: 1

Ross Bush
Ross Bush

Reputation: 15175

This one required two outer joins on a complex query. One join to identify all overlaps with a startdate larger than THIS and to expand the timespan to match the larger of the two. The second join is needed to match records with no overlaps. Take the Min of the min and the max of the max, including non matched. I was using MSSQL so the syntax may be a bit different.

DECLARE @temp_period TABLE
(
  id int NOT NULL,
  startDate datetime,
  endDate datetime
)

INSERT INTO @temp_period(id,startDate,endDate) VALUES(1,'2010-01-01','2010-03-31')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(2,'2013-05-17','2013-07-18')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-05-31')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(3,'2010-02-15','2010-07-31')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(7,'2014-01-01','2014-12-31')
INSERT INTO @temp_period(id,startDate,endDate) VALUES(56,'2014-03-31','2014-06-30')


;WITH OverLaps AS
(
    SELECT 
        Main.id,
        OverlappedID=Overlaps.id,
        OverlapMinDate,
        OverlapMaxDate
    FROM
        @temp_period Main
        LEFT OUTER JOIN
        (
            SELECT 
                This.id,
                OverlapMinDate=CASE WHEN This.StartDate<Prior.StartDate THEN This.StartDate ELSE Prior.StartDate END,
                OverlapMaxDate=CASE WHEN This.EndDate>Prior.EndDate THEN This.EndDate ELSE Prior.EndDate END,
                PriorID=Prior.id
            FROM
                @temp_period This
                LEFT OUTER JOIN @temp_period Prior ON Prior.endDate > This.startDate AND Prior.startdate < this.endDate AND This.Id<>Prior.ID
        ) Overlaps ON Main.Id=Overlaps.PriorId
)

SELECT
    T.Id,
    --If has overlapped then sum all overlapped records prior to this one, else not and overlap get the start and end
    MinDate=MIN(COALESCE(HasOverlapped.OverlapMinDate,startDate)),
    MaxDate=MAX(COALESCE(HasOverlapped.OverlapMaxDate,endDate))
FROM
    @temp_period T
    LEFT OUTER JOIN OverLaps IsAOverlap ON IsAOverlap.OverlappedID=T.id
    LEFT OUTER JOIN OverLaps HasOverlapped ON HasOverlapped.Id=T.id
WHERE
    IsAOverlap.OverlappedID IS NULL -- Exclude older records that have overlaps
GROUP BY
    T.Id

Upvotes: 0

Related Questions