Mike Jone
Mike Jone

Reputation: 23

COUNT(*) vs COUNT(column) Performance in Snowflake

Since SnowFlake is a columnar database, does it impact performance when you use COUNT(*) vs COUNT(column)? And this is assuming that the column that you're referencing does NOT have any NULLs

Upvotes: 0

Views: 8886

Answers (2)

Niru
Niru

Reputation: 489

Here is the snowflake doc.. hope it helps

https://docs.snowflake.com/en/sql-reference/functions/count.html Please refer to snowflake document.. it does effect count(alias.*) will check the each column in the row where as count(column) do null check only on that column..

Upvotes: 0

Gokhan Atil
Gokhan Atil

Reputation: 10079

As a_horse_with_no_name explained these two functions are different but you already mentioned that the column has no NULL values. So they should return the same result in your case.

More important thing is, Snowflake has a special optimization for the COUNT function. As far I see, it does NOT impact performance if you use COUNT(*) or COUNT(column), even when the column contains NULL values! For both of them, Snowflake uses METADATA statistics, so it does not actually count rows.

You can test it with SNOWFLAKE_SAMPLE_DATA:

select count(*) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- 5999989709

select count(L_ORDERKEY) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- 5999989709

Both queries will return a result immediately although the table size is about 170G, and contain more than 5G rows.

I have to add this extra information because of the conversation between Niru and a_horse_with_no_name. a_horse_with_no_name said:

Even if all columns of a row are NULL, count(*) should include that row in the result. If it doesn't this is a clear violation of the SQL standard

I'm not sure about the SQL standard but when you use COUNT(*), Snowflake doesn't check if the columns are NULL or not (as you expected). I can see why Niru misunderstood the documents, the docs and the samples should be improved.

If you run my sample queries, you will see that they are completed in milliseconds. We are talking about counting almost 6 billion rows:

select count(*) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- completes in milliseconds

select count(L_ORDERKEY) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- completes in milliseconds

But if I do a little modification on the query, it takes about 3 minutes on the same warehouse (XSMALL):

select count(t.*) from sample_data.TPCH_SF1000.LINEITEM t; 
-- completes in 3 minutes!?

Here is the trick:

Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs.

https://docs.snowflake.com/en/sql-reference/functions/count.html#arguments

Only if you use alias.* (like I used t.* in my sample), Snowflake will check if all columns are null when producing the count. This is why it is much slower, and this is why there shouldn't be any performance issues when you are running COUNT(XYZ) or COUNT(*) on a table.

Upvotes: 3

Related Questions