Rannie Xue
Rannie Xue

Reputation: 33

How to number consecutive records per island?

I have a table which looks like:

group    date        color
  A      1-1-2019      R
  A      1-2-2019      Y
  B      1-1-2019      R
  B      1-2-2019      Y
  B      1-3-2019      Y
  B      1-4-2019      R
  B      1-5-2019      R
  B      1-6-2019      R

And it's ordered by group and date. I want an extra column showing sequential number of consecutive color 'R' for each group.

Required output:

group    date        color    rank
  A      1-1-2019      R      1
  A      1-2-2019      Y      null
  B      1-1-2019      R      1
  B      1-2-2019      Y      null
  B      1-3-2019      Y      null
  B      1-4-2019      R      1
  B      1-5-2019      R      2
  B      1-6-2019      R      3

I've tried to use window function with partition by group and color columns but it returns output below which is not correct.

Wrong Query and Output:

SELECT 
    *, 
    RANK() OVER (PARTITION BY group, color order by group, date) as rank
FROM table

group    date        color    rank
  A      1-1-2019      R      1
  A      1-2-2019      Y      null
  B      1-1-2019      R      1
  B      1-2-2019      Y      null
  B      1-3-2019      Y      null
  B      1-4-2019      R      2
  B      1-5-2019      R      3
  B      1-6-2019      R      4

I'm wondering if it's doable in SQL, or should I switch to another language (like Python)?

Upvotes: 3

Views: 1132

Answers (3)

Erwin Brandstetter
Erwin Brandstetter

Reputation: 656724

Use the window function row_number() for a pure standard SQL solution in Postgres - or any modern RDBMS, even MySQL since version 8:

SELECT grp, the_date, color
     , row_number() OVER (PARTITION BY grp, color, part
                          ORDER BY the_date) AS rnk
FROM  (
   SELECT *
        , row_number() OVER (PARTITION BY grp ORDER BY the_date, color)
        - row_number() OVER (PARTITION BY grp, color ORDER BY the_date) AS part
   FROM   tbl
   ) sub
ORDER BY grp, the_date, color;

This assumes that the combination (grp, color, the_date) is defined UNIQUE, duplicates would create non-deterministic results.

Substracting the two different row numbers computes a distinct number per island (part). Then you can run row_number() once more, now partitioning by the subgroup additionally. Voilá.

To only see numbers for a particular color, 'R' in the example:

SELECT grp, the_date, color, CASE WHEN color = 'R' THEN rnk END AS rnk
FROM  (
   <<query from above, without ORDER BY>>
   ) sub
ORDER  BY grp, the_date, color;

While set-based solution are the forté of RDBMS' and typically faster, a procedural solution only needs a single scan for this type of problem, so this plpgsql function should be substantially faster:

CREATE OR REPLACE FUNCTION rank_color(_color text = 'R')  -- default 'R'
  RETURNS TABLE (grp text, the_date date, color text, rnk int) AS
$func$
DECLARE
   _last_grp text;
BEGIN
   FOR grp, the_date, color IN
      SELECT t.grp, t.the_date, t.color FROM tbl t ORDER BY 1,2
   LOOP
      IF color = $1 THEN
         IF _last_grp = grp THEN
            rnk := COALESCE(rnk + 1, 1);
         ELSE
            rnk := 1;
         END IF;
      ELSIF rnk > 0 THEN  -- minimize assignments
         rnk := NULL;
      END IF;

      RETURN NEXT;
      _last_grp := grp;
   END LOOP;
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM rank_color('R');

db<>fiddle here

Looping is not always the wrong solution in a relational database.

Further reading:

Aside: "rank" is a rather misleading name for those row numbers, unless you have duplicates supposed to rank equally ...

Upvotes: 1

Nick
Nick

Reputation: 147166

This is how it can be done using window functions. First we create a CTE which has a flag which indicates that a new sequence has started, then from that we generate one which counts sequence numbers. Finally we count rows within each sequence to get the rank:

WITH cte AS (SELECT `group`, date, color,
                    COALESCE(color = LAG(color) OVER(ORDER BY `group`, date), 0) AS samecolor
             FROM `table`),
sequences AS (SELECT `group`, date, color,
              SUM(samecolor = 0) OVER (ORDER BY `group`, date) AS seq_num
              FROM cte)
SELECT `group`, date, color,
       ROW_NUMBER() OVER (PARTITION BY seq_num) AS `rank`
FROM sequences
ORDER BY `group`, date

Output:

group   date        color   rank
A       1-1-2019    R       1
A       1-2-2019    Y       1
B       1-1-2019    R       1
B       1-2-2019    Y       1
B       1-3-2019    Y       2
B       1-4-2019    R       1
B       1-5-2019    R       2
B       1-6-2019    R       3

Demo on dbfiddle

Note that this query also gives ranking for Y values, if you want those to be NULL replace the definition of rank with this:

CASE WHEN color = 'Y' THEN NULL
     ELSE ROW_NUMBER() OVER (PARTITION BY seq_num) 
     END AS `rank`

Upvotes: 4

danblack
danblack

Reputation: 14666

Using user variables could keep the rank and previous values to produce the results:

 CREATE TABLE tbl (
   `group` VARCHAR(1),
   `date` VARCHAR(8),
   `color` VARCHAR(1)
 );

 INSERT INTO tbl
   (`group`, `date`, `color`)
 VALUES
   ('A', '1-1-2019', 'R'),
   ('A', '1-2-2019', 'Y'),
   ('B', '1-1-2019', 'R'),
   ('B', '1-2-2019', 'Y'),
   ('B', '1-3-2019', 'Y'),
   ('B', '1-4-2019', 'R'),
   ('B', '1-5-2019', 'R'),
   ('B', '1-6-2019', 'R');

 set @seq := 0, @prev := 'B'

 SELECT 
     *, 
     IF(color='R', @seq := IF(@prev = color, @seq + 1, 1), NULL) AS rank,
     @prev := color as prev
 FROM tbl
 ORDER BY `group`, `date`

group | date     | color | rank | prev
:---- | :------- | :---- | ---: | :---
A     | 1-1-2019 | R     |    1 | R   
A     | 1-2-2019 | Y     |      | Y   
B     | 1-1-2019 | R     |    1 | R   
B     | 1-2-2019 | Y     |      | Y   
B     | 1-3-2019 | Y     |      | Y   
B     | 1-4-2019 | R     |    1 | R   
B     | 1-5-2019 | R     |    2 | R   
B     | 1-6-2019 | R     |    3 | R   

db<>fiddle here

Upvotes: 3

Related Questions