Reynier Booysen
Reynier Booysen

Reputation: 281

SQL Server Group By Sets without duplicates

I need to create a "rolled up" slash "grouped" view of Customer Data for our client.

A simplified explanation would be that data need to be grouped by geographical (ex. Country, Province, City etc.) data and rolled up by the amount of people that have an email address and/or a phone number.

The problem is that a person can be in more than one Cities (lowest level) and then are counted multiple times in any higher levels (ex Province).

Here is an example using GROUPING SETS:

DECLARE @Customer TABLE
(
    CustomerId VARCHAR(50),
    Phone BIT,
    Email BIT,
    ProvinceId VARCHAR(50),
    CityId VARCHAR(50)
)

INSERT INTO @Customer(CustomerId, Phone, Email, ProvinceId, CityId) VALUES ('Customer A', 1, NULL, 'Province A', 'City A')
INSERT INTO @Customer(CustomerId, Phone, Email, ProvinceId, CityId) VALUES ('Customer A', 1, NULL, 'Province A', 'City B')
INSERT INTO @Customer(CustomerId, Phone, Email, ProvinceId, CityId) VALUES ('Customer B', 1, 1, 'Province A', 'City B')

SELECT COUNT(Phone) PersonWithPhoneCount, COUNT(Email) PersonWithEmailCount, ProvinceId, CityId FROM @Customer
GROUP BY GROUPING SETS ((ProvinceId), (ProvinceId, CityId))

and this is the result:

----------------------------------------------------------------------------
| PersonWithPhoneCount  |  PersonWithEmailCount  |  ProvinceId  |  CityId  |
----------------------------------------------------------------------------
|                    1  |                     0  |  Province A  |  City A  |
|                    2  |                     1  |  Province A  |  City B  |
|                    3  |                     1  |  Province A  |   NULL   |
----------------------------------------------------------------------------

The result is correct for the lowest level (City) but for the Province level "Customer A" is counted twice. I understand why, but is there a way to not count "Customer A" twice?

Do I have to group all the different levels individually or is there a better way?

Performance is also a major issue as the live data adds up to 100+ million rows.

Thanks in advance.

Upvotes: 0

Views: 90

Answers (1)

Steve
Steve

Reputation: 5545

Even though your data will be wrong, because there is no way Customer A can be in City A and City B, this sql will get you what you are asking for. I used the ROW_NUMBER() function so I only count the first occurrence of the customer.

SELECT COUNT(Phone) PersonWithPhoneCount, COUNT(Email) PersonWithEmailCount, ProvinceId, CityId
FROM (
SELECT *
    ,ROW_NUMBER() OVER(PARTITION BY CustomerId 
            ORDER BY ProvinceId, CityId) Row
FROM @Customer c1
) Tmp
Where Row = 1
GROUP BY GROUPING SETS ((ProvinceId), (ProvinceId, CityId))

Upvotes: 0

Related Questions