Tobias
Tobias

Reputation: 2139

How to get DISTINCT rows COUNTed with different conditions in a single SELECT statement?

I have a table with orders from different users. Each user (identified by an e-mail address) can have any number of orders. I don't have an extra table with users, just a plain OrderData table. What I want sounds simple:

  1. Total number of users
  2. Number of XYZ users
  3. Number of Non-XYZ users

where 1. is the sum of 2. and 3. "XYZ" users are defined by their e-mail address ending in "@xyz.com".

I want to have a single query, returning the three values in three columns. What I currently have is:

SELECT 
    (
    SELECT COUNT(DISTINCT User_EmailAddress)
    FROM OrderData
    WHERE User_EmailAddress IS NOT NULL
    AND RequestTime >= @RequestTimeFrom
    AND RequestTime  < @RequestTimeTo
    ) AS [Total Users],
    (
    SELECT COUNT(DISTINCT User_EmailAddress)
    FROM OrderData
    WHERE User_EmailAddress IS NOT NULL
    AND User_EmailAddress LIKE '%@xyz.com'
    AND RequestTime >= @RequestTimeFrom
    AND RequestTime  < @RequestTimeTo
    ) AS [XYZUsers],
    (
    SELECT COUNT(DISTINCT User_EmailAddress)
    FROM OrderData
    WHERE User_EmailAddress IS NOT NULL
    AND User_EmailAddress NOT LIKE '%@xyz.com'
    AND RequestTime >= @RequestTimeFrom
    AND RequestTime  < @RequestTimeTo
    ) AS [Non-XYZ Users]

It returns the correct result set:

Total Users | XYZ Users | Non-XYZ Users
------------+-----------+--------------
        123 |        23 |           100

Is there a nicer way to write this query, instead of basically having three similar queries with the same code?

Upvotes: 0

Views: 1822

Answers (2)

Aaron Bertrand
Aaron Bertrand

Reputation: 280252

Here is a solution that reduces redundant code and also avoid problems with multiple scans of the data. Since all you really care about is the counts, you don't even need to return any of the actual data, and you can collapse almost all of the logic into a single CASE expression:

;WITH x AS 
(
  SELECT isxyz = CASE WHEN User_EmailAddress LIKE '%@xyz.com' THEN 1 ELSE 0 END
  FROM dbo.OrderData
  WHERE User_EmailAddress IS NOT NULL
    AND RequestTime >= @RequestTimeFrom
    AND RequestTime  < @RequestTimeTo
    GROUP BY User_EmailAddress
)
SELECT 
  TotalUsers      = COUNT(isxyz),
  XYZUsers        = SUM(isxyz),
  [Non-XYZ Users] = COUNT(NULLIF(isxyz,1))
FROM x;

The trick is to use the aggregates to your advantage. Since the 1/0 output is not a true BIT column, you can SUM it to get the count of 1s. And the opposite can be achieved by changing the 1s to NULL, as NULLs are ignored by COUNT.

Upvotes: 2

Arvind Shyamsundar
Arvind Shyamsundar

Reputation: 1

The solution provided by bluefeet still has one drawback - each DISTINCT aggregation (in this case the COUNT) consumes a stream of input rows separately. This may cause separate scans / seeks of the base table - one set per distinct aggregation. Though sometimes the optimizer can spool the base table seek / scan into a work table and then compute the aggregates.

A more performance oriented way to do this (in this specific case) is to get the total count into a variable, then the xyz count into another variable, and then using scalar arithmetic deduce the third value (the non-xyz). That way you can avoid one spool replay (or even worse the base table seek / scan operator.)

Read this blog by Paul White for more information: http://web.archive.org/web/20170606142356/http://sqlblog.com/blogs/paul_white/archive/2011/12/04/is-distinct-aggregation-still-considered-harmful.aspx

Upvotes: 0

Related Questions