Reputation: 155

Sql ranking groups based on a values in a field

I have a table in which one column's values will be like this.

C
H
C
H
H
H
H
C
H
H
H

It will have one "C" and one or more "H" records following.

I am trying to give a group number for each set of "C" and one or more "H".

I don't want to use cursors for the fear of performance. How can I give a unique no for each subset of "C" and one or more "H" records?

Upvotes: 2

Answers (2)

Mike Sherrill 'Cat Recall'

Reputation: 95562

As long as your data is clean and consistent, this isn't too hard on platforms that support standard SQL window functions. You do need another column that you can meaningfully order by.

Let's build it up one piece at a time. (Written in PostgreSQL 9.3.)

create table test (
  test_id serial primary key,
  test_val char(1)
);
insert into test(test_val) values
('C'), ('H'), 
('C'),('H'),('H'),('H'),('H'),
('C'),('H'),('H'),('H');

We can tell when a group starts by looking at the next row.

select test_id, test_val,
       lead(test_val) over (order by test_id) next_test_val
from test;

First three rows from that query.

test_id  test_val  next_test_val
--
1        C         H
2        H         C
3        C         H
...

By checking for the "C" then "H" combination, we can identify the start of a group. (The previous query becomes a common table expression.)

with next_vals as (
select test_id, test_val,
       lead(test_val) over (order by test_id) next_test_val
from test
)
select *, case when test_val = 'C' and next_test_val = 'H' then test_id 
      end as grp
from next_vals;

Here are the first four rows from that result set. The id numbers are convenient for identifying a group.

test_id  test_val  next_test_val  grp
--
1        C         H              1
2        H         C
3        C         H              3
4        H         H
...

Another window function fills in the blanks. Again, the previous query becomes a CTE. The WHERE clause guards against a "C" row followed by another "C" row.

with next_vals as (
select test_id, test_val,
       lead(test_val) over (order by test_id) next_test_val
from test
), group_starts as (
select *
    , case when test_val = 'C' and next_test_val = 'H' then test_id 
      end as grp
from next_vals
)
select test_id, test_val, max(grp) over (order by test_id) as ch_group
from group_starts
where not (test_val = 'C' and next_test_val = 'C')
order by test_id;

test_id  test_val  ch_group
--
1        C         1
2        H         1

3        C         3
4        H         3
5        H         3
6        H         3
7        H         3

8        C         8
9        H         8
10       H         8
11       H         8

I added some line breaks to make it easier to read.

I don't know whether this will perform better than a cursor.

For sequential group numbers . . .

with next_vals as (
select test_id, test_val,
       lead(test_val) over (order by test_id) next_test_val
from test
), group_starts as (
select *
    , case when test_val = 'C' and next_test_val = 'H' then test_id 
      end as grp
from next_vals
), grouped_values as (
select test_id, test_val, max(grp) over (order by test_id) as ch_group
from group_starts
where not (test_val = 'C' and next_test_val = 'C')
)
select test_id, test_val, 
       dense_rank() over (order by ch_group)
from grouped_values
order by test_id;

Upvotes: 1

Vladimir Baranov

Reputation: 32695

Here is one possible solution that works in MS SQL Server 2008, which doesn't have LEAD function (it was added in later versions). Also, this solution numbers groups sequentially without gaps, as shown in the desired output.

It uses only ROW_NUMBER() function and CROSS APPLY.

It is necessary to have ID column that uniquely identifies each row and which we can use to sort results by.

Create a test table with sample data:

DECLARE @TT TABLE (ID int IDENTITY(1,1) PRIMARY KEY, Val char(1));

INSERT INTO @TT VALUES('C');
INSERT INTO @TT VALUES('H');
INSERT INTO @TT VALUES('C');
INSERT INTO @TT VALUES('H');
INSERT INTO @TT VALUES('H');
INSERT INTO @TT VALUES('H');
INSERT INTO @TT VALUES('H');
INSERT INTO @TT VALUES('C');
INSERT INTO @TT VALUES('H');
INSERT INTO @TT VALUES('H');
INSERT INTO @TT VALUES('H');

Get a list of all rows with C value. Each group starts with C, so there will be so many groups as we have Cs in the data. It doesn't matter what other values there are in this column, there can be other values as well, not just H. There is no hard-coded H in the query, only C.

WITH
CTE_C
AS
(
    SELECT ID, Val, ROW_NUMBER() OVER(ORDER BY ID) AS rn
    FROM @TT AS T
    WHERE Val = 'C'
)

The output of this CTE is: (SELECT * FROM CTE_C)

ID   Val   rn
1    C     1
3    C     2
8    C     3

Now all we need is for each row of the original data find a suitable row in the CTE, that has a suitable ID and consequently suitable rn. We use CROSS APPLY for it.

WITH
CTE_C
AS
(
    SELECT ID, Val, ROW_NUMBER() OVER(ORDER BY ID) AS rn
    FROM @TT AS T
    WHERE Val = 'C'
)
SELECT T.ID, T.Val, CTE_rn.rn
FROM
    @TT AS T
    CROSS APPLY
    (
        SELECT TOP(1) CTE_C.rn
        FROM CTE_C
        WHERE CTE_C.ID <= T.ID
        ORDER BY CTE_C.ID DESC
    ) AS CTE_rn
ORDER BY T.ID;

This is the final result:

ID   Val   rn
1    C     1
2    H     1
3    C     2
4    H     2
5    H     2
6    H     2
7    H     2
8    C     3
9    H     3
10   H     3
11   H     3

In terms of performance you need to test various solutions with your actual data and your actual system. ID should have unique index. Most likely an index on Val would be beneficial as well.

Upvotes: 0

Sql ranking groups based on a values in a field

Answers (2)

Related Questions