AlwaysLearning
AlwaysLearning

Reputation: 8051

Understanding characteristics of a query for which an index makes a dramatic difference

I am trying to come up with an example showing that indexes can have a dramatic (orders of magnitude) effect on query execution time. After hours of trial and error I am still at square one. Namely, the speed-up is not large even when the execution plan shows using the index.

Since I realized that I better have a large table for the index to make a difference, I wrote the following script (using Oracle 11g Express):

CREATE TABLE many_students (
  student_id NUMBER(11),
  city       VARCHAR(20)
);

DECLARE
  nStudents    NUMBER := 1000000;
  nCities      NUMBER := 10000;
  curCity      VARCHAR(20);
BEGIN
  FOR i IN 1 .. nStudents LOOP
    curCity := ROUND(DBMS_RANDOM.VALUE()*nCities, 0) || ' City';
    INSERT INTO many_students
    VALUES (i, curCity);
  END LOOP;
  COMMIT;
END;

I then tried quite a few queries, such as:

select count(*) 
from many_students M 
where M.city = '5467 City'; 

and

select count(*) 
from many_students M1
join many_students M2 using(city);

and a few other ones.

I have seen this post and think that my queries satisfy the requirements stated in the replies there. However, none of the queries I tried showed dramatic improvement after building an index: create index myindex on many_students(city);

Am I missing some characteristic that distinguishes a query for which an index makes a dramatic difference? What is it?

Upvotes: 0

Views: 74

Answers (2)

Jon Heller
Jon Heller

Reputation: 36922

The test case is a good start but it needs a few more things to get a noticeable performance difference:

  1. Realistic data sizes. One million rows of two small values is a small table. With a table that small the performance difference between a good and a bad execution plan may not matter much.

    The below script will double the table size until it gets to 64 million rows. It takes about 20 minutes on my machine. (To make it go quicker, for larger sizes, you could make the table nologging and add an /*+ append */ hint to the insert.

    --Increase the table to 64 million rows.  This took 20 minutes on my machine.
    insert into many_students select * from many_students;
    insert into many_students select * from many_students;
    insert into many_students select * from many_students;
    insert into many_students select * from many_students;
    insert into many_students select * from many_students;
    insert into many_students select * from many_students;
    commit;
    
    --The table has about 1.375GB of data.  The actual size will vary.
    select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'MANY_STUDENTS';
    
  2. Gather statistics. Always gather statistics after large table changes. The optimizer cannot do its job well unless it has table, column, and index statistics.

    begin
        dbms_stats.gather_table_stats(user, 'MANY_STUDENTS');
    end;
    /
    
  3. Use hints to force a good and bad plan. Optimizer hints should usually be avoided. But to quickly compare different plans they can be helpful to fix a bad plan.

    For example, this will force a full table scan:

    select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
    

    But you'll also want to verify the execution plan:

    explain plan for select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
    select * from table(dbms_xplan.display);
    
  4. Flush the cache. Caching is probably the main culprit behind the index and full table scan queries taking the same amount of time. If the table fits entirely in memory then the time to read all the rows may be almost too small to measure. The number could be dwarfed by the time to parse the query or to send a simple result across the network.

    This command will force Oracle to remove almost everything from the buffer cache. This will help you test a "cold" system. (You probably do not want to run this statement on a production system.)

    alter system flush buffer_cache;
    

    However, that won't flush the operating system or SAN cache. And maybe the table really would fit in memory on production. If you need to test a fast query it may be necessary to put it in a PL/SQL loop.

  5. Multiple, alternating runs. There many things happening in the background, like caching and other processes. It's so easy to get bad results because something unrelated changed on the system.

    Maybe the first run takes extra long to put things in a cache. Or maybe some huge job was started between queries. To avoid those issues, alternate running the two queries. Run them five times, throw out the highs and lows, and compare the averages.

    For example, copy and paste the statements below five times and run them. (If using SQL*Plus, run set timing on first.) I already did that and posted the times I got in a comment before each line.

    --Seconds: 0.02, 0.02, 0.03, 0.234, 0.02
    alter system flush buffer_cache;
    select count(*) from many_students M where M.city = '5467 City';
    
    --Seconds: 4.07, 4.21, 4.35, 3.629, 3.54
    alter system flush buffer_cache;
    select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
    
  6. Testing is hard. Putting together decent performance tests is difficult. The above rules are only a start.

    This might seem like overkill at first. But it's a complex topic. And I've seen so many people, including myself, waste a lot of time "tuning" something based on a bad test. Better to spend the extra time now and get the right answer.

Upvotes: 3

eaolson
eaolson

Reputation: 15090

An index really shines when the database doesn't need to go to every row in a table to get your results. So COUNT(*) isn't the best example. Take this for example:

alter session set statistics_level = 'ALL';
create table mytable as select * from all_objects;
select * from mytable where owner = 'SYS' and object_name = 'DUAL';

---------------------------------------------------------------------------------------
| Id  | Operation         | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |      1 |        |    300 |00:00:00.01 |      12 |
|   1 |  TABLE ACCESS FULL| MYTABLE |      1 |  19721 |    300 |00:00:00.01 |      12 |
---------------------------------------------------------------------------------------

So, here, the database does a full table scan (TABLE ACCESS FULL), which means it has to visit every row in the database, which means it has to load every block from disk. Lots of I/O. The optimizer guessed that it was going to find 15000 rows, but I know there's only one.

Compare that with this:

create index myindex on mytable( owner, object_name );
select * from mytable where owner = 'SYS' and object_name = 'JOB$';
select * from table( dbms_xplan.display_cursor( null, null, 'ALLSTATS LAST' ));

----------------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |         |      1 |        |      1 |00:00:00.01 |       3 |      2 |
|   1 |  TABLE ACCESS BY INDEX ROWID| MYTABLE |      1 |      2 |      1 |00:00:00.01 |       3 |      2 |
|*  2 |   INDEX RANGE SCAN          | MYINDEX |      1 |      1 |      1 |00:00:00.01 |       2 |      2 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OWNER"='SYS' AND "OBJECT_NAME"='JOB$')

Here, because there's an index, it does an INDEX RANGE SCAN to find the rowids for the table that match our criteria. Then, it goes to the table itself (TABLE ACCESS BY INDEX ROWID) and looks up only the rows we need and can do so efficiently because it has a rowid.

And even better, if you happen to be looking for something that is entirely in the index, the scan doesn't even have to go back to the base table. The index is enough:

select count(*) from mytable where owner = 'SYS';
select * from table( dbms_xplan.display_cursor( null, null, 'ALLSTATS LAST' ));

------------------------------------------------------------------------------------------------
| Id  | Operation         | Name    | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |      1 |        |      1 |00:00:00.01 |      46 |     46 |
|   1 |  SORT AGGREGATE   |         |      1 |      1 |      1 |00:00:00.01 |      46 |     46 |
|*  2 |   INDEX RANGE SCAN| MYINDEX |      1 |   8666 |   9294 |00:00:00.01 |      46 |     46 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("OWNER"='SYS')

Because my query involved the owner column and that's contained in the index, it never needs to go back to the base table to look anything up there. So the index scan is enough, then it does an aggregation to count the rows. This scenario is a little less than perfect, because the index is on (owner, object_name) and not just owner, but its definitely better than doing a full table scan on the main table.

Upvotes: 3

Related Questions