Jens Schauder
Jens Schauder

Reputation: 81907

Oracle Anti-Join Execution plan question

We have two tables like so:

Event
    id
    type
    ... a bunch of other columns

ProcessedEvent
    event_id
    process

There are indexes defined for

The first represents events in an application.

The second represents the fact that a certain event got processes by a certain process. There are many processes that need to process a certain event, so there are multiple entries in the second table for each entry in the first.

In order to find all the events that need processing we execute the following query:

select * // of course we do name the columns in the production code
from Event
where type in ( 'typeA', 'typeB', 'typeC')
and id not in (
    select event_id
    from ProcessedEvent
    where process = :1  
)

Statistics are up to date

Since most events are processed, I think the best execution plan should look something like this

Instead Oracle does the following

With an index hint I get Oracle to do the following:

which is really stupid IMHO.

So my question is: what might be the reason for oracle to insist on the early table access?


Addition: The performance is bad. We are fixing the performance problem by selecting just the Event.IDs and then fetching the needed rows 'manually'. But of course that is just a work around.

Upvotes: 1

Views: 1753

Answers (4)

Grzegorz Gierlik
Grzegorz Gierlik

Reputation: 11232

Something like:

WITH
  PROCEEDED AS
  (
    SELECT
      event_id
    FROM
      ProcessedEvent
    WHERE
      PROCESS = :1
  )
SELECT
  * // of course we do name the columns in the production code
FROM
  EVENT
LEFT JOIN PROCEEDED P
ON
  p.event_id = EVENT.event_id
WHERE
  type           IN ( 'typeA', 'typeB', 'typeC')
  AND p.event_id IS NULL; -- exclude already proceeded

could work fast enough (at least much faster than NOT IN).

Upvotes: 0

Vincent Malgrat
Vincent Malgrat

Reputation: 67722

your FULL INDEX SCAN will probably be faster than a FULL TABLE SCAN since the index is likely "thinner" than the table. Still, the FULL INDEX SCAN is a complete segment reading and it will be about the same cost as the FULL TABLE SCAN.

However, you're also adding a TABLE ACCESS BY ROWID step. It is an expensive step: one logical IO per row for the ROWID access whereas you get one logical IO per multi blocks (depending upon your db_file_multiblock_read_count parameter) for the FULL TABLE SCAN.

In conclusion, the optimizer computes that:

cost(FULL TABLE SCAN) < cost(FULL INDEX SCAN) + cost(TABLE ACCESS BY ROWID)

Update: The FULL TABLE SCAN also enables the filter on type sooner than in the FULL INDEX SCAN path (since the INDEX doesn't know what type an event is), therefore reducing the size of the set that will be anti-joined (yet another advantage of the FULL TABLE SCAN).

Upvotes: 2

John Stauffer
John Stauffer

Reputation: 16360

I can't explain the behavior of the optimizer, but my experience has been to avoid "NOT IN" at all costs, replacing it instead with MINUS, like so:

select * from Event
where id in (
  select id from Event where type in ( 'typeA', 'typeB', 'typeC')
 minus
  select id from ProcessedEvent
)

I've seen orders of magnitude in query performance with similar transformations.

Upvotes: 0

anon
anon

Reputation:

The optimizer does many things which do not make sense at first, but it has it's reasons. They may not always be right, but they're understandable.

The Event table may be easier to full-scan rather than by rowid access because of its size. It could be that there are significantly fewer IO operations involved to read the entire table sequentially than to read bits and pieces.

Is the performance bad, or are you just asking why the optimizer did that?

Upvotes: 0

Related Questions