Reputation: 130
I have a list of dates and IDs, and I would like to roll them up into periods of consucitutive dates, within each ID.
For a table with the columns "testid" and "pulldate" in a table called "data":
| A79 | 2010-06-02 |
| A79 | 2010-06-03 |
| A79 | 2010-06-04 |
| B72 | 2010-04-22 |
| B72 | 2010-06-03 |
| B72 | 2010-06-04 |
| C94 | 2010-04-09 |
| C94 | 2010-04-10 |
| C94 | 2010-04-11 |
| C94 | 2010-04-12 |
| C94 | 2010-04-13 |
| C94 | 2010-04-14 |
| C94 | 2010-06-02 |
| C94 | 2010-06-03 |
| C94 | 2010-06-04 |
I want to generate a table with the columns "testid", "group", "start_date", "end_date":
| A79 | 1 | 2010-06-02 | 2010-06-04 |
| B72 | 2 | 2010-04-22 | 2010-04-22 |
| B72 | 3 | 2010-06-03 | 2010-06-04 |
| C94 | 4 | 2010-04-09 | 2010-04-14 |
| C94 | 5 | 2010-06-02 | 2010-06-04 |
This is the the code I came up with:
SELECT t2.testid,
t2.group,
MIN(t2.pulldate) AS start_date,
MAX(t2.pulldate) AS end_date
FROM(SELECT t1.pulldate,
t1.testid,
SUM(t1.check) OVER (ORDER BY t1.testid,t1.pulldate) AS group
FROM(SELECT data.pulldate,
data.testid,
CASE
WHEN data.testid=LAG(data.testid,1)
OVER (ORDER BY data.testid,data.pulldate)
AND data.pulldate=date (LAG(data.pulldate,1)
OVER (PARTITION BY data.testid
ORDER BY data.pulldate)) + integer '1'
THEN 0
ELSE 1
END AS check
FROM data
ORDER BY data.testid, data.pulldate) AS t1) AS t2
GROUP BY t2.testid,t2.group
ORDER BY t2.group;
I used the LAG windowing function to compare each row to the previous, putting a 1 if I need to increment to start a new group, I then do a running sum of that column, and then aggregate to the combinations of "group" and "testid".
Is there a better way to accomplish my goal, or does this operation have a name?
I am using PostgreSQL 8.4
Upvotes: 2
Views: 556
Reputation: 75986
Here's another approach:
WITH TEMP_TAB AS (
SELECT testid, pulldate,
(pulldate + (row_number || ' days')::interval)::date AS dummydate
FROM ( SELECT *, row_number() OVER () FROM
( SELECT * FROM data ORDER BY testid,pulldate DESC
) AS tab1
) AS tab2
)
SELECT * FROM (
SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate
FROM TEMP_TAB GROUP BY testid,dummydate
) AS tab3
ORDER BY testid, mindate
Warning: this strategy breaks if there are repeated (testid, pulldate)
pairs. In this case, one should first do a DISTINCT over those fields.
Explanation: The intermediate table has a dummydate
, obtained by adding a number of days equal to the "row number" (in the ordered select); its only meaning is that rows with same dummydate
are in the same set of consecutive dates. Eg: intermediate results:
test=# SELECT *, row_number() OVER () FROM
test-# ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1;
testid | pulldate | row_number
--------+------------+------------
A79 | 2010-06-04 | 1
A79 | 2010-06-03 | 2
A79 | 2010-06-02 | 3
B72 | 2010-06-04 | 4
B72 | 2010-06-03 | 5
B72 | 2010-04-22 | 6
C94 | 2010-06-04 | 7
C94 | 2010-06-03 | 8
C94 | 2010-06-02 | 9
C94 | 2010-04-14 | 10
C94 | 2010-04-13 | 11
C94 | 2010-04-12 | 12
C94 | 2010-04-11 | 13
C94 | 2010-04-10 | 14
C94 | 2010-04-09 | 15
test=# SELECT
test-# testid,pulldate,(pulldate + (row_number || 'days')::interval)::date AS dummydate
test-# FROM ( SELECT *, row_number() OVER () FROM
test(# ( SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )
test-# AS tab2;
testid | pulldate | dummydate
--------+------------+------------
A79 | 2010-06-04 | 2010-06-05
A79 | 2010-06-03 | 2010-06-05
A79 | 2010-06-02 | 2010-06-05
B72 | 2010-06-04 | 2010-06-08
B72 | 2010-06-03 | 2010-06-08
B72 | 2010-04-22 | 2010-04-28
C94 | 2010-06-04 | 2010-06-11
C94 | 2010-06-03 | 2010-06-11
C94 | 2010-06-02 | 2010-06-11
C94 | 2010-04-14 | 2010-04-24
C94 | 2010-04-13 | 2010-04-24
C94 | 2010-04-12 | 2010-04-24
C94 | 2010-04-11 | 2010-04-24
C94 | 2010-04-10 | 2010-04-24
C94 | 2010-04-09 | 2010-04-24
Edit: The WITH is not necessary here (but i like it nevertheless), this is the same:
SELECT * FROM (
SELECT testid, min(pulldate) AS mindate, max(pulldate) AS maxdate
FROM (
SELECT
testid,pulldate,
(pulldate + (row_number || ' days')::interval)::date AS dummydate
FROM ( SELECT *, row_number() OVER () FROM
(
SELECT * FROM data ORDER BY testid,pulldate DESC) AS tab1 )
AS tab2
) as temp_tab
GROUP BY testid,dummydate
) AS tab3
ORDER BY testid, mindate
Upvotes: 1
Reputation: 133692
I don't know of any known name for this technique. I tried writing it myself and came up with something essentially equivalent to yours- differing only in having one less WindowAgg.
select testid, group_num as group,
min(pulldate) as start_date,
max(pulldate) as end_date
from (select testid,
pulldate,
sum(case when projected_pulldate is null or pulldate <> projected_pulldate
then 1 else 0 end) over (order by testid, pulldate) as group_num
from (select testid, pulldate,
(lag(pulldate, 1) over (partition by testid order by pulldate)
) + 1 as projected_pulldate
from data) x
) grouped
group by testid, group_num
order by 1, 2
This is hardly pretty, and I do wonder if this is simply a case where using plpgsql or similar might be a better fit.
create or replace function data_extents()
returns table(testid char(3), "group" int, start_date date, end_date date)
language plpgsql
stable as $$
declare
rec data%rowtype;
begin
"group" := 1;
for rec in select * from data order by testid, pulldate loop
if testid is null then
-- first row
testid := rec.testid;
start_date := rec.pulldate;
end_date := rec.pulldate;
elsif rec.testid <> testid or rec.pulldate <> (end_date + 1) then
-- discontinuity
return next;
testid := rec.testid;
start_date := rec.pulldate;
end_date := rec.pulldate;
"group" := "group" + 1;
else
end_date := end_date + 1;
end if;
end loop;
if testid is not null then
return next;
end if;
end;
$$;
This is hardly pretty either... although it is in principle deriving the output from a single scan without doing several different aggregations, which at least feels better. It takes just the same time on the tiny dataset; larger dataset? I haven't tried it yet tbh.
Since neither of our solutions allow predicates such as "testid = XXX" to be taken into the scan on data (afaict) a function may be the only way to do efficient filtering?
Upvotes: 1