Reputation: 14845
I have a very large table in oracle 11g that has a very simple index in a char field (that is normally Y or N) If I just execute the queue as bellow it takes around 10s to return
select QueueId, QueueSiteId, QueueData from queue where QueueProcessed = 'N'
However if I force it to use the index I create it takes 80ms
select /*+ INDEX(avaqueue QUEUEPROCESSED_IDX) */ QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
Also if I run under the explain plan for as bellow:
explain plan for select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
and
explain plan for select /*+ INDEX(avaqueue QUEUEPROCESSED_IDX) */
QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N'
For the frist plan I got:
------------------------------------------------------------------------------
Plan hash value: 803924726
------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 691K| 128M| 12643 (1)| 00:02:32 |
|* 1 | TABLE ACCESS FULL| AVAQUEUE | 691K| 128M| 12643 (1)| 00:02:32 |
------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("QUEUEPROCESSED"='N')
For the second pla I got:
Plan hash value: 2012309891
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 691K| 128M| 24386 (1)| 00:04:53 |
| 1 | TABLE ACCESS BY INDEX ROWID| AVAQUEUE | 691K| 128M| 24386 (1)| 00:04:53 |
|* 2 | INDEX RANGE SCAN | QUEUEPROCESSED_IDX | 691K| | 1297 (1)| 00:00:16 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("QUEUEPROCESSED"='N')
------------------------------------------------------------------------------
What proves that if I don't explicit tell oracle to use the index it does not use it, my question is why is oracle not using this index? Oracle is normally smart enough to make decisions 10 times better than me, that is the first time I actually have to force oracle to use a index and I am not very comfortable with it.
Does anyone have a good explanation for oracle decision to not use the index in this very explicit case?
Upvotes: 2
Views: 645
Reputation: 36922
The QueueProcessed column is probably missing a histogram so Oracle does not know the data is skewed.
If Oracle does not know the data is skewed it will assume the equality predicate, QueueProcessed = 'N'
, returns DBA_TABLES.NUM_ROWS /
DBA_TAB_COLUMNS.NUM_DISTINCT. The optimizer thinks the query returns half the rows in the table. Based on the 80ms return time the real number of rows returned is small.
Index range scans generally only work well when they select a small percentage of the rows. Index range scans read from a data structure one block at a time. And if the data is randomly distributed, it may need to read every block of data from the table anyway. For those reasons, if the query accesses a large portion of the table, it is more efficient to use a multi-block full table scan.
The bad cardinality estimate from the skewed data causes Oracle to think a full table scan is better. Creating a histogram will fix the issue.
Sample schema
Create a table, fill it with skewed data, and gather statistics the first time.
drop table queue;
create table queue(
queueid number,
queuesiteid number,
queuedata varchar2(4000),
queueprocessed varchar2(1)
);
create index QUEUEPROCESSED_IDX on queue(queueprocessed);
--Skewed data - only 100 of the 100000 rows are set to N.
insert into queue
select level, level, level, decode(mod(level, 1000), 0, 'N', 'Y')
from dual connect by level <= 100000;
begin
dbms_stats.gather_table_stats(user, 'QUEUE');
end;
/
The first execution will have the problem.
In this case the default statistics settings do not gather histograms the first time. The plan shows a full table scan and estimates Rows=50000, exactly half.
explain plan for
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
select * from table(dbms_xplan.display);
Plan hash value: 1157425618
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 50000 | 878K| 103 (1)| 00:00:01 |
|* 1 | TABLE ACCESS FULL| QUEUE | 50000 | 878K| 103 (1)| 00:00:01 |
---------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter("QUEUEPROCESSED"='N')
Create a histogram
The default statistics settings are usually sufficient. Histogram may not be collected for several reasons. They may be manually disabled - check for the tasks, jobs or preferences set by the DBA.
Also, histograms are only automatically collected on columns that are both skewed and used. Gathering histograms can take time, there's no need to create the histogram on a column that is never used in a relevant predicate. Oracle tracks when a column is used and could benefit from a histogram, although that data is lost if the table is dropped.
Running a sample query and re-gathering statistics will make the histogram appear:
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
begin
dbms_stats.gather_table_stats(user, 'QUEUE');
end;
/
Now the Rows=100 and the Index is used.
explain plan for
select QueueId, QueueSiteId, QueueData
from queue where QueueProcessed = 'N';
select * from table(dbms_xplan.display);
Plan hash value: 2630796144
----------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100 | 1800 | 2 (0)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| QUEUE | 100 | 1800 | 2 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | QUEUEPROCESSED_IDX | 100 | | 1 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - access("QUEUEPROCESSED"='N')
Here's the histogram:
select column_name, histogram
from dba_tab_columns
where table_name = 'QUEUE'
order by column_name;
COLUMN_NAME HISTOGRAM
----------- ---------
QUEUEDATA NONE
QUEUEID NONE
QUEUEPROCESSED FREQUENCY
QUEUESITEID NONE
Create the histogram
Try to determine why the histogram was missing. Check that statistics are gathered with the defaults, there are no weird column or table preferences, and that table is not constantly dropped and re-loaded.
If you cannot rely on the default statistics job for your process you can manually gather histograms with the method_opt
parameter like this:
begin
dbms_stats.gather_table_stats(user, 'QUEUE', method_opt=>'for columns size 254 queueprocessed');
end;
/
Upvotes: 2
Reputation: 48131
The answer - at least the first one that will just lead to more questions - is right there in the plans. The first plan has an estimated cost and estimated execution time about half that of the second plan. In the absence of the hint, Oracle is choosing the plan that it thinks will run faster.
So of course the next question is why is its estimate so far off in this case. Not only are the estimated times wrong relative to each other, both are much greater than what you actually experience when running the query.
The first thing I would look at is the estimated number of rows returned. The optimizer is guessing, in both cases, that there are about 691,000 rows in table matching your predicate. Is this close to the truth, or very far off? If it's far off, then refreshing statistics may be the right solution. Although if the column only has two possible values, I'd be kind of surprised if the existing stats are so off base.
Upvotes: 2