How can I normalize the capitalization of a group-by column?

On SQL Server configured as case-insensitive, group by can have interesting results when the [n][var]char column is not the first group by column. Essentially, it looks like whatever row it encounters "first" (where "first" is undefined in the absence of an order): wins for that grouping. For example:

select x.[day], x.[name], count(1) as [count]
from (
    select 1 as [day], 'a' as [name]
    union all select 1, 'A'
    union all select 2, 'A'
    union all select 2, 'a'
    ) x group by x.[day], x.[name]

which returns, for me:

day         name count
----------- ---- -----------
1           A    2
2           a    2

Using min(x.[name]) has no effect, since the grouping already happened.

I can't add an order by before the group by, as that is illegal; and adding the order by after the group by just defines the output order after the grouping - it still gives a and A.

So: is there a sane way of doing this where the capitalization will at least be consistent for all the groupings? (I'll leave for another day the problem of being consistent for separate runs)

Desired output, either:

day         name count
----------- ---- -----------
1           A    2
2           A    2

or:

day         name count
----------- ---- -----------
1           a    2
2           a    2

Edit: without destroying capitalisation when it is consistent between groups. So no upper/lower. So if one of the groups consistently has the value BcDeF, I want the result of that row to be BcDeF, not bcdef or BCDEF.

Upvotes: 5

Answers (4)

jarlh

Reputation: 44776

Use a case insensitive collation in the Group by, e.g:

select day, name, count(*)
from tablename
group by day, name collate SQL_Latin1_General_Cp1_CI_AS_KI_WI

Perhaps SQL Server has problems here? Using another dbms it executes as:

SQL>create table t (d int, name varchar(10));
SQL>insert into t values (1,'A');
SQL>insert into t values (2,'A');
SQL>insert into t values (2,'a');
SQL>insert into t values (3,'BcDeF');
SQL>insert into t values (3,'bCdEf');
SQL>insert into t values (4,'a');
SQL>select d, name, count(*)
SQL&from t
SQL&group by d, name collate english_1;
          d name
=========== ========== ====================
          1 A                             1
          2 A                             2
          3 BcDeF                         2
          4 a                             1

                  4 rows found

Where english_1 is a case insensitive collation.

As expected?

Upvotes: 1

Lamak

Reputation: 70648

I would use windowing functions for this. By using ROW_NUMBER and partitioning using a case insensitive collation, but ordering by a case sensitive one, we will choose consistently one result with the original capitalisation, but it will group them as if they are the same:

WITH CTE AS
(
    SELECT  *,
            RN = ROW_NUMBER() OVER(PARTITION BY [day], [name]
                                   ORDER BY [name] COLLATE SQL_Latin1_General_Cp1_Cs_AS),
            N = COUNT(*) OVER(PARTITION BY [day], [name])
    FROM (  select 1 as [day], 'a' as [name]
            union all select 1, 'A'
            union all select 2, 'A'
            union all select 2, 'a'
            union all select 3, 'BcDeF'
            union all select 3, 'bCdEf') X
)
SELECT *
FROM CTE
WHERE RN = 1;

It returns:

╔═════╦═══════╦════╦═══╗
║ day ║ name  ║ RN ║ N ║
╠═════╬═══════╬════╬═══╣
║   1 ║ A     ║  1 ║ 2 ║
║   2 ║ A     ║  1 ║ 2 ║
║   3 ║ BcDeF ║  1 ║ 2 ║
╚═════╩═══════╩════╩═══╝

Following @AndriyM's comment, if you want the same capitalisation over the whole result set, and not just the same day, you can use:

WITH CTE AS
(
    SELECT  *,
            RN = ROW_NUMBER() OVER(PARTITION BY [day], [name]
                                   ORDER BY [name] COLLATE SQL_Latin1_General_Cp1_Cs_AS),
            N = COUNT(*) OVER(PARTITION BY [day], [name])
    FROM (  select 1 as [day], 'a' as [name]
            union all select 1, 'A'
            union all select 2, 'A'
            union all select 2, 'a'
            union all select 3, 'BcDeF'
            union all select 3, 'bCdEf') X
)
SELECT  [day],
        MAX([name] COLLATE SQL_Latin1_General_Cp1_CS_AS) OVER (PARTITION BY [name]) [name],
        N
FROM CTE
WHERE RN = 1;

Upvotes: 8

przemo_li

Reputation: 4053

You can use UPPER in GROUP BY clause to transfer all the values to same capitalization.

Upvotes: 0

Gordon Linoff

Reputation: 1269953

Use upper() or lower():

select x.[day], lower(x.[name]) as name, count(1) as [count]
from (
    select 1 as [day], 'a' as [name]
    union all select 1, 'A'
    union all select 2, 'A'
    union all select 2, 'a'
    ) x
group by x.[day], x.[name];

You are correct that SQL Server chooses a value from an indeterminate row. min() and max() don't help, because the values are equivalent. The simplest solution is to explicitly choose the case that you want.

Upvotes: 1

How can I normalize the capitalization of a group-by column?

Answers (4)

Related Questions