mtryingtocode
mtryingtocode

Reputation: 969

Grouping consecutive dates in PostgreSQL

I have two tables which I need to combine as sometimes some dates are found in table A and not in table B and vice versa. My desired result is that for those overlaps on consecutive days be combined.

I'm using PostgreSQL.

Table A

id  startdate   enddate
--------------------------
101 12/28/2013  12/31/2013

Table B

id  startdate   enddate
--------------------------
101 12/15/2013  12/15/2013
101 12/16/2013  12/16/2013
101 12/28/2013  12/28/2013
101 12/29/2013  12/31/2013

Desired Result

id  startdate   enddate
-------------------------
101 12/15/2013  12/16/2013
101 12/28/2013  12/31/2013

Upvotes: 3

Views: 1209

Answers (2)

wildplasser
wildplasser

Reputation: 44240

The below fragment does what you intend. (but it will probably be very slow) The problem is that detecteng (non)overlapping dateranges is impossible with standard range operators, since a range could be split into two parts. So, my code does the following:

  • split the dateranges from table_A into atomic records, with one date per record
  • [the same for table_b]
  • cross join these two tables (we are only interested in A_not_in_B, and B_not_in_A) , remembering which of the L/R outer join wings it came from.
  • re-aggregate the resulting records into date ranges.

-- EXPLAIN ANALYZE
-- 
WITH  RECURSIVE ranges AS (
            -- Chop up the a-table into atomic date units
    WITH ar AS (
            SELECT generate_series(a.startdate,a.enddate , '1day'::interval)::date  AS thedate
            ,  'A'::text AS which
            , a.id
            FROM a
            )
            -- Same for the b-table
    , br AS (
            SELECT generate_series(b.startdate,b.enddate, '1day'::interval)::date  AS thedate
            ,  'B'::text AS which
            , b.id
            FROM b
            )
            -- combine the two sets, retaining a_not_in_b plus b_not_in_a
    , moments AS (
            SELECT COALESCE(ar.id,br.id) AS id
            , COALESCE(ar.which, br.which) AS which
            , COALESCE(ar.thedate, br.thedate) AS thedate
            FROM ar
            FULL JOIN br ON br.id = ar.id AND br.thedate =  ar.thedate
            WHERE ar.id IS NULL OR br.id IS NULL
            )
            -- use a recursive CTE to re-aggregate the atomic moments into ranges
    SELECT m0.id, m0.which
            , m0.thedate AS startdate
            , m0.thedate AS enddate
    FROM moments m0
    WHERE NOT EXISTS ( SELECT * FROM moments nx WHERE nx.id = m0.id  AND nx.which = m0.which
            AND nx.thedate = m0.thedate -1
            )
    UNION ALL
    SELECT rr.id, rr.which
            , rr.startdate AS startdate
            , m1.thedate AS enddate
    FROM ranges rr
    JOIN moments m1 ON m1.id = rr.id AND m1.which = rr.which AND m1.thedate = rr.enddate +1
    )
SELECT * FROM ranges ra
WHERE NOT EXISTS (SELECT * FROM ranges nx
    -- suppress partial subassemblies
    WHERE nx.id = ra.id AND nx.which = ra.which
    AND nx.startdate = ra.startdate
    AND nx.enddate > ra.enddate
    )
 ;

Upvotes: 2

mlinth
mlinth

Reputation: 3108

Right. I have a query that I think works. It certainly works on the sample records you provided. It uses a recursive CTE.

First, you need to merge the two tables. Next, use a recursive CTE to get the sequences of overlapping dates. Finally, get the start and end dates, and join back to the "merged" table to get the id.

with recursive allrecords as -- this merges the input tables. Add a unique row identifier
(
    select *, row_number() over (ORDER BY startdate) as rowid from
    (select * from table1
    UNION
    select * from table2) a
),
 path as ( -- the recursive CTE. This gets the sequences
    select rowid as parent,rowid,startdate,enddate from allrecords a
    union 
    select p.parent,b.rowid,b.startdate,b.enddate from  allrecords b  join path p on (p.enddate + interval '1 day')>=b.startdate and p.startdate <= b.startdate
)


SELECT id,g.startdate,g.enddate FROM -- outer query to get the id

    -- inner query to get the start and end of each sequence
    (select parent,min(startdate) as startdate, max(enddate) as enddate from
    (
        select *, row_number() OVER (partition by rowid order by parent,startdate) as row_number from path
    ) a
    where row_number = 1 -- We only want the first occurrence of each record
    group by parent)g
INNER JOIN allrecords a on a.rowid = parent

Upvotes: 2

Related Questions