Reputation: 8572
Using PostgreSQL 9.0.
Let's say I have a table containing the fields: company
, profession
and year
. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric sequence:
Example Table:
+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google | Programmer | 2000 |
| Google | Sales | 2000 |
| Google | Sales | 2001 |
| Google | Sales | 2002 |
| Google | Sales | 2004 |
| Mozilla | Sales | 2002 |
+-----------------------------+
I'm interested in a query which would output rows similar to the following:
+-----------------------------------------+
| company | profession | year |
+---------+------------+------------------+
| Google | Programmer | [2000] |
| Google | Sales | [2000,2001,2002] |
| Google | Sales | [2004] |
| Mozilla | Sales | [2002] |
+-----------------------------------------+
The essential feature is that only consecutive years shall be grouped together.
Upvotes: 16
Views: 8004
Reputation: 656714
The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with PL/pgSQL can make do with a single sequential scan over the table (implicit cursor of a FOR
loop) and should be substantially faster in this particular case:
Test table:
CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
('Google', 'Programmer', 2000)
, ('Google', 'Sales', 2000)
, ('Google', 'Sales', 2001)
, ('Google', 'Sales', 2002)
, ('Google', 'Sales', 2004)
, ('Mozilla', 'Sales', 2002)
;
Function:
CREATE OR REPLACE FUNCTION f_periods()
RETURNS TABLE (company text, profession text, years int[])
LANGUAGE plpgsql AS
$func$
DECLARE
r tbl; -- use table type as row variable
r0 tbl;
BEGIN
FOR r IN
SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
LOOP
IF ( r.company, r.profession, r.year)
<> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row
RETURN QUERY
SELECT r0.company, r0.profession, years; -- output row
years := ARRAY[r.year]; -- start new array
ELSE
years := years || r.year; -- add to array - year can be NULL, too
END IF;
r0 := r; -- remember last row
END LOOP;
RETURN QUERY -- output last iteration
SELECT r0.company, r0.profession, years;
END
$func$;
Call:
SELECT * FROM f_periods();
db<>fiddle here
Produces the requested result.
Upvotes: 5
Reputation: 77677
There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.
And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company
and profession
, now you only need something to distinguish years that belong to different sequences.
That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:
WITH MarkedForGrouping AS (
SELECT
company,
profession,
year,
year - ROW_NUMBER() OVER (
PARTITION BY company, profession
ORDER BY year
) AS seqID
FROM atable
)
SELECT
company,
profession,
array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
company,
profession,
seqID
Upvotes: 22
Reputation:
Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).
The first step is to identify non-consecutive values for the year:
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
This returns the following result:
company | profession | year | group_cnt ---------+------------+------+----------- Google | Programmer | 2000 | 1 Google | Sales | 2000 | 1 Google | Sales | 2001 | 0 Google | Sales | 2002 | 0 Google | Sales | 2004 | 1 Mozilla | Sales | 2002 | 1
Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
This returns the following result:
company | profession | year | group_nr ---------+------------+------+---------- Google | Programmer | 2000 | 1 Google | Sales | 2000 | 2 Google | Sales | 2001 | 2 Google | Sales | 2002 | 2 Google | Sales | 2004 | 3 Mozilla | Sales | 2002 | 4 (6 rows)
As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:
select company,
profession,
array_agg(year) as years
from (
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr
This returns the following result:
company | profession | years ---------+------------+------------------ Google | Programmer | {2000} Google | Sales | {2000,2001,2002} Google | Sales | {2004} Mozilla | Sales | {2002} (4 rows)
Which is exactly what you wanted, if I'm not mistaken.
Upvotes: 26