sqlpostgresqlplpgsqlwindow-functionsgaps-and-islands

Reputation: 8572

GROUP BY and aggregate sequential numeric values

Using PostgreSQL 9.0.

Let's say I have a table containing the fields: company, profession and year. I want to return a result which contains unique companies and professions, but aggregates (into an array is fine) years based on numeric sequence:

Example Table:

+-----------------------------+
| company | profession | year |
+---------+------------+------+
| Google  | Programmer | 2000 |
| Google  | Sales      | 2000 |
| Google  | Sales      | 2001 |
| Google  | Sales      | 2002 |
| Google  | Sales      | 2004 |
| Mozilla | Sales      | 2002 |
+-----------------------------+

I'm interested in a query which would output rows similar to the following:

+-----------------------------------------+
| company | profession | year             |
+---------+------------+------------------+
| Google  | Programmer | [2000]           |
| Google  | Sales      | [2000,2001,2002] |
| Google  | Sales      | [2004]           |
| Mozilla | Sales      | [2002]           |
+-----------------------------------------+

The essential feature is that only consecutive years shall be grouped together.

Upvotes: 16

Answers (3)

Erwin Brandstetter

Reputation: 656714

Procedural solution with PL/pgSQL

The problem is rather unwieldy for plain SQL with aggregate / windows functions. While looping is typically slower than set-based solutions with plain SQL, a procedural solution with PL/pgSQL can make do with a single sequential scan over the table (implicit cursor of a FOR loop) and should be substantially faster in this particular case:

Test table:

CREATE TEMP TABLE tbl (company text, profession text, year int);
INSERT INTO tbl VALUES
  ('Google',  'Programmer', 2000)
, ('Google',  'Sales',      2000)
, ('Google',  'Sales',      2001)
, ('Google',  'Sales',      2002)
, ('Google',  'Sales',      2004)
, ('Mozilla', 'Sales',      2002)
;

Function:

CREATE OR REPLACE FUNCTION f_periods()
  RETURNS TABLE (company text, profession text, years int[])
  LANGUAGE plpgsql AS
$func$
DECLARE
   r  tbl; -- use table type as row variable
   r0 tbl;
BEGIN
   FOR r IN
      SELECT * FROM tbl t ORDER BY t.company, t.profession, t.year
   LOOP
      IF ( r.company,  r.profession,  r.year)
      <> (r0.company, r0.profession, r0.year + 1) THEN -- not true for first row

         RETURN QUERY
         SELECT r0.company, r0.profession, years; -- output row

         years := ARRAY[r.year];     -- start new array
      ELSE
         years := years || r.year;   -- add to array - year can be NULL, too
      END IF;

      r0 := r;                       -- remember last row
   END LOOP;

   RETURN QUERY                      -- output last iteration
   SELECT r0.company, r0.profession, years;
END
$func$;

Call:

SELECT * FROM f_periods();

db<>fiddle here

Produces the requested result.

Upvotes: 5

Andriy M

Reputation: 77677

There's much value to @a_horse_with_no_name's answer, both as a correct solution and, like I already said in a comment, as a good material for learning how to use different kinds of window functions in PostgreSQL.

And yet I cannot help feeling that the approach taken in that answer is a bit too much of an effort for a problem like this one. Basically, what you need is an additional criterion for grouping before you go on aggregating years in arrays. You've already got company and profession, now you only need something to distinguish years that belong to different sequences.

That is just what the above mentioned answer provides and that is precisely what I think can be done in a simpler way. Here's how:

WITH MarkedForGrouping AS (
  SELECT
    company,
    profession,
    year,
    year - ROW_NUMBER() OVER (
      PARTITION BY company, profession
      ORDER BY year
    ) AS seqID
  FROM atable
)
SELECT
  company,
  profession,
  array_agg(year) AS years
FROM MarkedForGrouping
GROUP BY
  company,
  profession,
  seqID

Upvotes: 22

user330315

Reputation:

Identifying non-consecutive values is always a bit tricky and involves several nested sub-queries (at least I cannot come up with a better solution).

The first step is to identify non-consecutive values for the year:

Step 1) Identify non-consecutive values

select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification

This returns the following result:

 company | profession | year | group_cnt
---------+------------+------+-----------
 Google  | Programmer | 2000 |         1
 Google  | Sales      | 2000 |         1
 Google  | Sales      | 2001 |         0
 Google  | Sales      | 2002 |         0
 Google  | Sales      | 2004 |         1
 Mozilla | Sales      | 2002 |         1

Now with the group_cnt value we can create "group IDs" for each group that has consecutive years:

Step 2) Define group IDs

select company,
   profession,
   year,
   sum(group_cnt) over (order by company, profession, year) as group_nr
from ( 
select company, 
       profession,
       year,
       case 
          when row_number() over (partition by company, profession order by year) = 1 or 
               year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
          else 0
       end as group_cnt
from qualification
) t1

This returns the following result:

 company | profession | year | group_nr
---------+------------+------+----------
 Google  | Programmer | 2000 |        1
 Google  | Sales      | 2000 |        2
 Google  | Sales      | 2001 |        2
 Google  | Sales      | 2002 |        2
 Google  | Sales      | 2004 |        3
 Mozilla | Sales      | 2002 |        4
(6 rows)

As you can see each "group" got its own group_nr and this we can finally use to aggregate over by adding yet another derived table:

Step 3) Final query

select company,
       profession,
       array_agg(year) as years
from (
  select company,
       profession,
       year,
       sum(group_cnt) over (order by company, profession, year) as group_nr
  from ( 
    select company, 
           profession,
           year,
           case 
              when row_number() over (partition by company, profession order by year) = 1 or 
                   year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
              else 0
           end as group_cnt
    from qualification
  ) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr

This returns the following result:

 company | profession |      years
---------+------------+------------------
 Google  | Programmer | {2000}
 Google  | Sales      | {2000,2001,2002}
 Google  | Sales      | {2004}
 Mozilla | Sales      | {2002}
(4 rows)

Which is exactly what you wanted, if I'm not mistaken.

Upvotes: 26

GROUP BY and aggregate sequential numeric values

Answers (3)

Procedural solution with PL/pgSQL

Step 1) Identify non-consecutive values

Step 2) Define group IDs

Step 3) Final query

Related Questions