Reputation: 31

count (distinct ...) VS count from (select distinct ...) VS group by

On snowflake, I'm getting different results depending on how I count the distinct values from the same table. I used to think them as equivalent. Given the discrepancies, first I'd like to know in which scenarios these strategies can not be interchanged, and second, how to tell what strategy is returning the right number.

I include the query I'm using to test this

select 'count_distinct_subquery' as strat,count(*) from (
  select distinct 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
  from src 
)
union all
select 'count_distinct' as strat,count(
  distinct 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
)
from src 
union all 
select 'group_by_subquery' as strat, count(*) from (
  select *
  from src 
  group by 
  plan_code,
  fis_we_dt,
  sku_no,
  pog_segment_name,
  shelf_no,
  position_id
)

The output is as in the image

Upvotes: 1

Answers (2)

Simeon Pilgrim

Reputation: 25968

As per Lukasz answer:

with data(col1, col2) as (
    select * from values
    (1, 10),
    (1, 10),
    (1, null),
    (null, null),
    (null, null)
), unions as (
    select 
        'count_distinct_subquery' as strat,
        count(*) as count
    from (
        select distinct col1, col2
        from data 
    )
    
    union all
    
    select 
        'count_distinct' as strat,
        count(distinct col1, col2)
    from data
    
    union all 
    
    select 
        'group_by_subquery' as strat, 
        count(*) 
    from (
        select *
        from data
        group by 1,2
    )
)
select * from unions;

gives:

STRAT	COUNT
count_distinct_subquery	3
count_distinct	1
group_by_subquery	3

the first and last are the same thing. The sub-select DISTINCT and GROUP BY are find the permutations, AND they respect NULLs as things, and then you count the number of rows present, which is 3.

The middle is asking for the count of row with no-nulls, which is 1.

expr1 This should be either:

A column name, which can be a qualified name (e.g. > database.schema.table.column_name).

Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs. See Examples for an example.

This can be further seen by taking the sub-selects from the first and third queries, and wrap those in a count(distinct col1, col2) instead of the count(*)

    select 
        'group_by_subquery_count_cols' as strat, 
        count(distinct col1, col2)
    from (
        select *
        from data
        group by 1,2
    )

    union all

    select 
        'count_distinct_subquery_count_cols' as strat,
        count(distinct col1, col2)
    from (
        select distinct col1, col2
        from data 
    )

and now we get 1's again not three's, from the same data.

STRAT	COUNT
count_distinct_subquery	3
count_distinct	1
group_by_subquery	3
group_by_subquery_count_cols	1
count_distinct_subquery_count_cols	1

Upvotes: 0

Lukasz Szozda

Reputation: 175726

The second version count (distinct expr1, ...) skips NULLs.

CREATE OR REPLACE TABLE src
AS 
SELECT NULL AS plan_code, 1 AS fis_we_dt;

select 'count_distinct_subquery' as strat,count(*) from (
  select distinct 
  plan_code,
  fis_we_dt
  from src 
)
union all
select 'count_distinct' as strat,count(
  distinct 
  plan_code,
  fis_we_dt
)
from src 
union all 
select 'group_by_subquery' as strat, count(*) from (
  select *
  from src 
  group by 
  plan_code,
  fis_we_dt
);

COUNT(*)
COUNT( [ DISTINCT ] <expr1> [ , <expr2> ... ] )
Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs. See Examples for an example.

Aggregate Functions and NULL Values

Some aggregate functions can be passed more than one column. For example:
SELECT COUNT(col1, col2) FROM table1;
In these instances, the aggregate function ignores a row if any individual column is NULL.

Upvotes: 1

count (distinct ...) VS count from (select distinct ...) VS group by

Answers (2)

Related Questions