Reputation: 33
I have a table which looks like:
group date color
A 1-1-2019 R
A 1-2-2019 Y
B 1-1-2019 R
B 1-2-2019 Y
B 1-3-2019 Y
B 1-4-2019 R
B 1-5-2019 R
B 1-6-2019 R
And it's ordered by group and date. I want an extra column showing sequential number of consecutive color 'R' for each group.
Required output:
group date color rank
A 1-1-2019 R 1
A 1-2-2019 Y null
B 1-1-2019 R 1
B 1-2-2019 Y null
B 1-3-2019 Y null
B 1-4-2019 R 1
B 1-5-2019 R 2
B 1-6-2019 R 3
I've tried to use window function with partition by group and color columns but it returns output below which is not correct.
Wrong Query and Output:
SELECT
*,
RANK() OVER (PARTITION BY group, color order by group, date) as rank
FROM table
group date color rank
A 1-1-2019 R 1
A 1-2-2019 Y null
B 1-1-2019 R 1
B 1-2-2019 Y null
B 1-3-2019 Y null
B 1-4-2019 R 2
B 1-5-2019 R 3
B 1-6-2019 R 4
I'm wondering if it's doable in SQL, or should I switch to another language (like Python)?
Upvotes: 3
Views: 1132
Reputation: 656724
Use the window function row_number()
for a pure standard SQL solution in Postgres - or any modern RDBMS, even MySQL since version 8:
SELECT grp, the_date, color
, row_number() OVER (PARTITION BY grp, color, part
ORDER BY the_date) AS rnk
FROM (
SELECT *
, row_number() OVER (PARTITION BY grp ORDER BY the_date, color)
- row_number() OVER (PARTITION BY grp, color ORDER BY the_date) AS part
FROM tbl
) sub
ORDER BY grp, the_date, color;
This assumes that the combination (grp, color, the_date)
is defined UNIQUE
, duplicates would create non-deterministic results.
Substracting the two different row numbers computes a distinct number per island (part
). Then you can run row_number()
once more, now partitioning by the subgroup additionally. Voilá.
To only see numbers for a particular color, 'R' in the example:
SELECT grp, the_date, color, CASE WHEN color = 'R' THEN rnk END AS rnk
FROM (
<<query from above, without ORDER BY>>
) sub
ORDER BY grp, the_date, color;
While set-based solution are the forté of RDBMS' and typically faster, a procedural solution only needs a single scan for this type of problem, so this plpgsql function should be substantially faster:
CREATE OR REPLACE FUNCTION rank_color(_color text = 'R') -- default 'R'
RETURNS TABLE (grp text, the_date date, color text, rnk int) AS
$func$
DECLARE
_last_grp text;
BEGIN
FOR grp, the_date, color IN
SELECT t.grp, t.the_date, t.color FROM tbl t ORDER BY 1,2
LOOP
IF color = $1 THEN
IF _last_grp = grp THEN
rnk := COALESCE(rnk + 1, 1);
ELSE
rnk := 1;
END IF;
ELSIF rnk > 0 THEN -- minimize assignments
rnk := NULL;
END IF;
RETURN NEXT;
_last_grp := grp;
END LOOP;
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM rank_color('R');
db<>fiddle here
Looping is not always the wrong solution in a relational database.
Further reading:
Aside: "rank" is a rather misleading name for those row numbers, unless you have duplicates supposed to rank equally ...
Upvotes: 1
Reputation: 147166
This is how it can be done using window functions. First we create a CTE which has a flag which indicates that a new sequence has started, then from that we generate one which counts sequence numbers. Finally we count rows within each sequence to get the rank:
WITH cte AS (SELECT `group`, date, color,
COALESCE(color = LAG(color) OVER(ORDER BY `group`, date), 0) AS samecolor
FROM `table`),
sequences AS (SELECT `group`, date, color,
SUM(samecolor = 0) OVER (ORDER BY `group`, date) AS seq_num
FROM cte)
SELECT `group`, date, color,
ROW_NUMBER() OVER (PARTITION BY seq_num) AS `rank`
FROM sequences
ORDER BY `group`, date
Output:
group date color rank
A 1-1-2019 R 1
A 1-2-2019 Y 1
B 1-1-2019 R 1
B 1-2-2019 Y 1
B 1-3-2019 Y 2
B 1-4-2019 R 1
B 1-5-2019 R 2
B 1-6-2019 R 3
Note that this query also gives ranking for Y
values, if you want those to be NULL
replace the definition of rank
with this:
CASE WHEN color = 'Y' THEN NULL
ELSE ROW_NUMBER() OVER (PARTITION BY seq_num)
END AS `rank`
Upvotes: 4
Reputation: 14666
Using user variables could keep the rank and previous values to produce the results:
CREATE TABLE tbl (
`group` VARCHAR(1),
`date` VARCHAR(8),
`color` VARCHAR(1)
);
INSERT INTO tbl
(`group`, `date`, `color`)
VALUES
('A', '1-1-2019', 'R'),
('A', '1-2-2019', 'Y'),
('B', '1-1-2019', 'R'),
('B', '1-2-2019', 'Y'),
('B', '1-3-2019', 'Y'),
('B', '1-4-2019', 'R'),
('B', '1-5-2019', 'R'),
('B', '1-6-2019', 'R');
set @seq := 0, @prev := 'B'
SELECT
*,
IF(color='R', @seq := IF(@prev = color, @seq + 1, 1), NULL) AS rank,
@prev := color as prev
FROM tbl
ORDER BY `group`, `date`
group | date | color | rank | prev
:---- | :------- | :---- | ---: | :---
A | 1-1-2019 | R | 1 | R
A | 1-2-2019 | Y | | Y
B | 1-1-2019 | R | 1 | R
B | 1-2-2019 | Y | | Y
B | 1-3-2019 | Y | | Y
B | 1-4-2019 | R | 1 | R
B | 1-5-2019 | R | 2 | R
B | 1-6-2019 | R | 3 | R
db<>fiddle here
Upvotes: 3