Reputation: 31
On snowflake, I'm getting different results depending on how I count the distinct values from the same table. I used to think them as equivalent. Given the discrepancies, first I'd like to know in which scenarios these strategies can not be interchanged, and second, how to tell what strategy is returning the right number.
I include the query I'm using to test this
select 'count_distinct_subquery' as strat,count(*) from (
select distinct
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
from src
)
union all
select 'count_distinct' as strat,count(
distinct
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
)
from src
union all
select 'group_by_subquery' as strat, count(*) from (
select *
from src
group by
plan_code,
fis_we_dt,
sku_no,
pog_segment_name,
shelf_no,
position_id
)
Upvotes: 1
Views: 91
Reputation: 25968
As per Lukasz answer:
with data(col1, col2) as (
select * from values
(1, 10),
(1, 10),
(1, null),
(null, null),
(null, null)
), unions as (
select
'count_distinct_subquery' as strat,
count(*) as count
from (
select distinct col1, col2
from data
)
union all
select
'count_distinct' as strat,
count(distinct col1, col2)
from data
union all
select
'group_by_subquery' as strat,
count(*)
from (
select *
from data
group by 1,2
)
)
select * from unions;
gives:
STRAT | COUNT |
---|---|
count_distinct_subquery | 3 |
count_distinct | 1 |
group_by_subquery | 3 |
the first and last are the same thing. The sub-select DISTINCT and GROUP BY are find the permutations, AND they respect NULLs as things, and then you count the number of rows present, which is 3.
The middle is asking for the count of row with no-nulls, which is 1.
expr1 This should be either:
A column name, which can be a qualified name (e.g. > database.schema.table.column_name).
Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs. See Examples for an example.
This can be further seen by taking the sub-selects from the first and third queries, and wrap those in a count(distinct col1, col2)
instead of the count(*)
select
'group_by_subquery_count_cols' as strat,
count(distinct col1, col2)
from (
select *
from data
group by 1,2
)
union all
select
'count_distinct_subquery_count_cols' as strat,
count(distinct col1, col2)
from (
select distinct col1, col2
from data
)
and now we get 1's again not three's, from the same data.
STRAT | COUNT |
---|---|
count_distinct_subquery | 3 |
count_distinct | 1 |
group_by_subquery | 3 |
group_by_subquery_count_cols | 1 |
count_distinct_subquery_count_cols | 1 |
Upvotes: 0
Reputation: 175726
The second version count (distinct expr1, ...)
skips NULLs.
CREATE OR REPLACE TABLE src
AS
SELECT NULL AS plan_code, 1 AS fis_we_dt;
select 'count_distinct_subquery' as strat,count(*) from (
select distinct
plan_code,
fis_we_dt
from src
)
union all
select 'count_distinct' as strat,count(
distinct
plan_code,
fis_we_dt
)
from src
union all
select 'group_by_subquery' as strat, count(*) from (
select *
from src
group by
plan_code,
fis_we_dt
);
COUNT( [ DISTINCT ] <expr1> [ , <expr2> ... ] )
Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs. See Examples for an example.
Aggregate Functions and NULL Values
Some aggregate functions can be passed more than one column. For example:
SELECT COUNT(col1, col2) FROM table1;
In these instances, the aggregate function ignores a row if any individual column is NULL.
Upvotes: 1