WHERE vs. HAVING performance with GROUP BY

Question

So I was assigned to estimate the performance of two queries and came to a surprising result. I was told beforehand that HAVING is slower than WHERE because it only filters results after accessing rows. This seems reasonable, and this question on SQL clause execution order strenghtened that.

However, I estimated the performance of the following queries with some assumptions and it seems using HAVING the execution is actually faster!

SELECT status, count(status)
FROM customer
GROUP BY status
HAVING status != 'Active' AND status != 'Dormant'

SELECT status, count(status)
FROM customer
WHERE status != 'Active' AND status != 'Dormant'
GROUP BY status

The assumptions were:

The table CUSTOMER has 100 000 records
The cost of accessing a row is 0.01ms (SELECT + COUNT)
The cost of executing a clause is 0.005ms
There are three types of customer statuses, the two above and 'Deceased'
There are 15 000 'Deceased' customers

Based on this my estimations were:

First query:
    Accessing all rows, FROM: 100 000 * 0.01ms = 1000ms
    GROUP BY: 100 000 * 0.005ms = 500ms
    HAVING (2 conditions, 3 groups): 2 * 3 * 0.005ms = 0.03ms
    SELECT and COUNT results: 15 000 * 0.01ms = 150ms
    Total execution time: 1.65003s

Second query:
    Accessing all the rows, FROM: 1000ms
    WHERE: 2 * 100 000 * 0.005ms = 1000ms
    GROUP BY: 15 000 * 0.005ms = 75ms
    SELECT and COUNT results: 15 000 * 0.01ms = 150ms
    Total execution time: 2.225s

The result came from the fact that GROUP BY produces only three groups, which are very easy to filter, while WHERE has to go through and filter the records one by one.

As I naïvely rely on authority, I'm assuming I either made a mistake somewhere or the assumptions provided are wrong.

So does GROUP BY behave like this with HAVING resulting in a reduced execution time?

EDIT: Query plans

PLAN_TABLE_OUTPUT /* With HAVING */

| Id  | Operation                   | Name | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |      |     5 |    35 |     4  (25)| 00:00:01 |
|*  1 |  FILTER                     |      |       |       |            |          |
|   2 |   HASH GROUP BY             |      |     5 |    35 |     4  (25)| 00:00:01 |
|   3 |    TABLE ACCESS STORAGE FULL| CUSM |     5 |    35 |     3   (0)| 00:00:01 |
------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("STATUS"<>'Active' AND "STATUS"<>'Dormant')


PLAN_TABLE_OUTPUT /* With WHERE */
-----------------------------------------------------------------------------------
| Id  | Operation                  | Name | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |      |     1 |     7 |     4  (25)| 00:00:01 |
|   1 |  HASH GROUP BY             |      |     1 |     7 |     4  (25)| 00:00:01 |
|*  2 |   TABLE ACCESS STORAGE FULL| CUSM |     1 |     7 |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage("STATUS"<>'Active' AND "STATUS"<>'Dormant')
    filter("STATUS"<>'Active' AND "STATUS"<>'Dormant')

Jon Heller · Accepted Answer

One of your assumptions is wrong: HAVING is slower than WHERE because it only filters results after accessing and hashing rows.

It's that hashing part that makes HAVING conditions more expensive than WHERE conditions. Hashing requires writing data, which can be more expensive both physically and algorithmically.

Theory

Hashing requires writing as well as reading data. Ideally hashing the data will run in O(n) time. But in practice there will be hash collisions, which slow things down. And in practice not all the data will fit in memory.

Those two problems can be disastrous. In the worst-case, with limited memory, the hashing requires multiple passes and the complexity approaches O(n^2). And writing to disk in the temporary tablespace is orders of magnitude slower than writing to memory.

Those are the kind of performance issues you need to worry about with databases. The constant time to run simple conditions and expressions is usually irrelevant compared to the time to read, write, and join the data.

That might be especially true in your environment. The operation TABLE ACCESS STORAGE FULL implies you are using Exadata. Depending on the platform you might be taking advantage of SQL in silicon. Those high-level conditions may translate perfectly to low-level instructions executed on storage devices. Which means your estimate of the cost of executing a clause may be several orders of magnitude too high.

Practice

Create a sample table with 100,000 rows:

create table customer(id number, status varchar2(100));

insert into customer
select
    level,
    case
        when level <= 15000 then 'Deceased'
        when level between 15001 and 50001 then 'Active'
        else 'Dormant'
    end
from dual
connect by level <= 100000;

begin
    dbms_stats.gather_table_stats(user, 'customer');
end;
/

Running the code in a loop shows that the WHERE version is about twice as fast as the HAVING version.

--Run times (in seconds): 0.765, 0.78, 0.765
declare
    type string_nt is table of varchar2(100);
    type number_nt is table of number;
    v_status string_nt;
    v_count number_nt;
begin
    for i in 1 .. 100 loop
        SELECT status, count(status)
        bulk collect into v_status, v_count
        FROM customer
        GROUP BY status
        HAVING status != 'Active' AND status != 'Dormant';
    end loop;
end;
/

--Run times (in seconds): 0.39, 0.39, 0.39
declare
    type string_nt is table of varchar2(100);
    type number_nt is table of number;
    v_status string_nt;
    v_count number_nt;
begin
    for i in 1 .. 100 loop
        SELECT status, count(status)
        bulk collect into v_status, v_count
        FROM customer
        WHERE status != 'Active' AND status != 'Dormant'
        GROUP BY status;
    end loop;
end;
/

WHERE vs. HAVING performance with GROUP BY

EDIT: Query plans

Answers (2)

Theory

Practice

Related Questions