Aggregating a list of dates to start and end date

Question

I have a list of dates and IDs, and I would like to roll them up into periods of consucitutive dates, within each ID.

For a table with the columns "testid" and "pulldate" in a table called "data":

| A79 | 2010-06-02 |
| A79 | 2010-06-03 |
| A79 | 2010-06-04 |
| B72 | 2010-04-22 |
| B72 | 2010-06-03 |
| B72 | 2010-06-04 |
| C94 | 2010-04-09 |
| C94 | 2010-04-10 |
| C94 | 2010-04-11 |
| C94 | 2010-04-12 |
| C94 | 2010-04-13 |
| C94 | 2010-04-14 |
| C94 | 2010-06-02 |
| C94 | 2010-06-03 |
| C94 | 2010-06-04 |

I want to generate a table with the columns "testid", "group", "start_date", "end_date":

| A79 | 1 | 2010-06-02 | 2010-06-04 |
| B72 | 2 | 2010-04-22 | 2010-04-22 |
| B72 | 3 | 2010-06-03 | 2010-06-04 |
| C94 | 4 | 2010-04-09 | 2010-04-14 |
| C94 | 5 | 2010-06-02 | 2010-06-04 |

This is the the code I came up with:

SELECT t2.testid,
  t2.group,
  MIN(t2.pulldate) AS start_date,
  MAX(t2.pulldate) AS end_date
FROM(SELECT t1.pulldate,
  t1.testid,
  SUM(t1.check) OVER (ORDER BY t1.testid,t1.pulldate) AS group
FROM(SELECT data.pulldate,
  data.testid,
  CASE
  WHEN data.testid=LAG(data.testid,1) 
    OVER (ORDER BY data.testid,data.pulldate)
  AND data.pulldate=date (LAG(data.pulldate,1) 
    OVER (PARTITION BY data.testid 
    ORDER BY data.pulldate)) + integer '1'
  THEN 0
  ELSE 1
  END AS check
FROM data 
ORDER BY data.testid, data.pulldate) AS t1) AS t2
GROUP BY t2.testid,t2.group
ORDER BY t2.group;

I used the LAG windowing function to compare each row to the previous, putting a 1 if I need to increment to start a new group, I then do a running sum of that column, and then aggregate to the combinations of "group" and "testid".

Is there a better way to accomplish my goal, or does this operation have a name?

I am using PostgreSQL 8.4

leonbloy · Accepted Answer

Here's another approach:

WITH TEMP_TAB AS (
SELECT testid, pulldate,
       (pulldate + (row_number || ' days')::interval)::date AS dummydate
 FROM ( SELECT *, row_number() OVER () FROM
    ( SELECT * FROM data ORDER BY testid,pulldate DESC
    ) AS tab1 
 ) AS tab2 
)
SELECT * FROM (
  SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate 
    FROM TEMP_TAB GROUP BY testid,dummydate 
  )  AS tab3 
ORDER BY testid, mindate

Warning: this strategy breaks if there are repeated (testid, pulldate) pairs. In this case, one should first do a DISTINCT over those fields.

Explanation: The intermediate table has a dummydate, obtained by adding a number of days equal to the "row number" (in the ordered select); its only meaning is that rows with same dummydate are in the same set of consecutive dates. Eg: intermediate results:

test=#  SELECT *, row_number() OVER  () FROM
test-#   ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1;
 testid |  pulldate  | row_number
--------+------------+------------
 A79    | 2010-06-04 |          1
 A79    | 2010-06-03 |          2
 A79    | 2010-06-02 |          3
 B72    | 2010-06-04 |          4
 B72    | 2010-06-03 |          5
 B72    | 2010-04-22 |          6
 C94    | 2010-06-04 |          7
 C94    | 2010-06-03 |          8
 C94    | 2010-06-02 |          9
 C94    | 2010-04-14 |         10
 C94    | 2010-04-13 |         11
 C94    | 2010-04-12 |         12
 C94    | 2010-04-11 |         13
 C94    | 2010-04-10 |         14
 C94    | 2010-04-09 |         15



test=# SELECT
test-#  testid,pulldate,(pulldate + (row_number || 'days')::interval)::date AS dummydate
test-#  FROM ( SELECT *, row_number() OVER  () FROM
test(#   ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )
test-#  AS tab2;
 testid |  pulldate  | dummydate
--------+------------+------------
 A79    | 2010-06-04 | 2010-06-05
 A79    | 2010-06-03 | 2010-06-05
 A79    | 2010-06-02 | 2010-06-05
 B72    | 2010-06-04 | 2010-06-08
 B72    | 2010-06-03 | 2010-06-08
 B72    | 2010-04-22 | 2010-04-28
 C94    | 2010-06-04 | 2010-06-11
 C94    | 2010-06-03 | 2010-06-11
 C94    | 2010-06-02 | 2010-06-11
 C94    | 2010-04-14 | 2010-04-24
 C94    | 2010-04-13 | 2010-04-24
 C94    | 2010-04-12 | 2010-04-24
 C94    | 2010-04-11 | 2010-04-24
 C94    | 2010-04-10 | 2010-04-24
 C94    | 2010-04-09 | 2010-04-24

Edit: The WITH is not necessary here (but i like it nevertheless), this is the same:

SELECT * FROM (
  SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate 
  FROM (
    SELECT
      testid,pulldate,
      (pulldate + (row_number || ' days')::interval)::date AS dummydate
    FROM ( SELECT *, row_number() OVER  () FROM
      ( 
       SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )  
       AS tab2 
    ) as temp_tab
  GROUP BY testid,dummydate 
)  AS tab3
ORDER BY testid, mindate

Aggregating a list of dates to start and end date

Answers (2)

Related Questions