Reputation: 2678
So I was assigned to estimate the performance of two queries and came to a surprising result. I was told beforehand that HAVING
is slower than WHERE
because it only filters results after accessing rows. This seems reasonable, and this question on SQL clause execution order strenghtened that.
However, I estimated the performance of the following queries with some assumptions and it seems using HAVING
the execution is actually faster!
SELECT status, count(status)
FROM customer
GROUP BY status
HAVING status != 'Active' AND status != 'Dormant'
SELECT status, count(status)
FROM customer
WHERE status != 'Active' AND status != 'Dormant'
GROUP BY status
The assumptions were:
CUSTOMER
has 100 000 recordsBased on this my estimations were:
First query:
Accessing all rows, FROM: 100 000 * 0.01ms = 1000ms
GROUP BY: 100 000 * 0.005ms = 500ms
HAVING (2 conditions, 3 groups): 2 * 3 * 0.005ms = 0.03ms
SELECT and COUNT results: 15 000 * 0.01ms = 150ms
Total execution time: 1.65003s
Second query:
Accessing all the rows, FROM: 1000ms
WHERE: 2 * 100 000 * 0.005ms = 1000ms
GROUP BY: 15 000 * 0.005ms = 75ms
SELECT and COUNT results: 15 000 * 0.01ms = 150ms
Total execution time: 2.225s
The result came from the fact that GROUP BY
produces only three groups, which are very easy to filter, while WHERE
has to go through and filter the records one by one.
As I naïvely rely on authority, I'm assuming I either made a mistake somewhere or the assumptions provided are wrong.
So does GROUP BY
behave like this with HAVING
resulting in a reduced execution time?
PLAN_TABLE_OUTPUT /* With HAVING */
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 5 | 35 | 4 (25)| 00:00:01 |
|* 1 | FILTER | | | | | |
| 2 | HASH GROUP BY | | 5 | 35 | 4 (25)| 00:00:01 |
| 3 | TABLE ACCESS STORAGE FULL| CUSM | 5 | 35 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("STATUS"<>'Active' AND "STATUS"<>'Dormant')
PLAN_TABLE_OUTPUT /* With WHERE */
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 7 | 4 (25)| 00:00:01 |
| 1 | HASH GROUP BY | | 1 | 7 | 4 (25)| 00:00:01 |
|* 2 | TABLE ACCESS STORAGE FULL| CUSM | 1 | 7 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - storage("STATUS"<>'Active' AND "STATUS"<>'Dormant')
filter("STATUS"<>'Active' AND "STATUS"<>'Dormant')
Upvotes: 6
Views: 4264
Reputation: 36902
One of your assumptions is wrong: HAVING is slower than WHERE because it only filters results after accessing and hashing rows.
It's that hashing part that makes HAVING conditions more expensive than WHERE conditions. Hashing requires writing data, which can be more expensive both physically and algorithmically.
Hashing requires writing as well as reading data. Ideally hashing the data will run in O(n)
time. But in practice there will be hash collisions, which slow things down. And in practice not all the data will fit in memory.
Those two problems can be disastrous. In the worst-case, with limited memory, the hashing requires multiple passes and the complexity approaches O(n^2)
. And writing to disk in the temporary tablespace is orders of magnitude slower than writing to memory.
Those are the kind of performance issues you need to worry about with databases. The constant time to run simple conditions and expressions is usually irrelevant compared to the time to read, write, and join the data.
That might be especially true in your environment. The operation TABLE ACCESS STORAGE FULL
implies you are using Exadata. Depending on the platform you might be taking advantage of SQL in silicon. Those high-level conditions may translate perfectly to low-level instructions executed on storage devices. Which means your estimate of the cost of executing a clause may be several orders of magnitude too high.
Create a sample table with 100,000 rows:
create table customer(id number, status varchar2(100));
insert into customer
select
level,
case
when level <= 15000 then 'Deceased'
when level between 15001 and 50001 then 'Active'
else 'Dormant'
end
from dual
connect by level <= 100000;
begin
dbms_stats.gather_table_stats(user, 'customer');
end;
/
Running the code in a loop shows that the WHERE
version is about twice as fast as the HAVING
version.
--Run times (in seconds): 0.765, 0.78, 0.765
declare
type string_nt is table of varchar2(100);
type number_nt is table of number;
v_status string_nt;
v_count number_nt;
begin
for i in 1 .. 100 loop
SELECT status, count(status)
bulk collect into v_status, v_count
FROM customer
GROUP BY status
HAVING status != 'Active' AND status != 'Dormant';
end loop;
end;
/
--Run times (in seconds): 0.39, 0.39, 0.39
declare
type string_nt is table of varchar2(100);
type number_nt is table of number;
v_status string_nt;
v_count number_nt;
begin
for i in 1 .. 100 loop
SELECT status, count(status)
bulk collect into v_status, v_count
FROM customer
WHERE status != 'Active' AND status != 'Dormant'
GROUP BY status;
end loop;
end;
/
Upvotes: 2
Reputation: 48850
Here's the thing:
According to the Oracle execution plan, both queries are performing a full table scan. That is, they are reading ALL THE ROWS of the table. No difference there.
The HAVING query performs the GROUP BY (hashing) resulting in 3 rows. Then, it applies the filter to those 3 rows, and returns the result.
The WHERE query applies the filter to each row (100,000 in the specification) after reading it, reducing them to 15,000. Finally it groups these (hashing) into 1 row, and returns one row.
I think in the case you are describing, the WHERE query applies the filter to all 100,000 rows, while the HAVING query defers the filter and only applies it to 3 rows. That makes the HAVING query faster.
Now, don't assume this result will apply to each query you have like this. Oracle is very clever in using table statistics. The plan will change in the future according to the real data you are adding to the table. A plan with 5 rows is by no means representative of a plan for 100,000 rows.
Take this result with a grain of salt. Real world scenarios are way trickier.
Upvotes: 3