Understanding characteristics of a query for which an index makes a dramatic difference

Question

I am trying to come up with an example showing that indexes can have a dramatic (orders of magnitude) effect on query execution time. After hours of trial and error I am still at square one. Namely, the speed-up is not large even when the execution plan shows using the index.

Since I realized that I better have a large table for the index to make a difference, I wrote the following script (using Oracle 11g Express):

CREATE TABLE many_students (
  student_id NUMBER(11),
  city       VARCHAR(20)
);

DECLARE
  nStudents    NUMBER := 1000000;
  nCities      NUMBER := 10000;
  curCity      VARCHAR(20);
BEGIN
  FOR i IN 1 .. nStudents LOOP
    curCity := ROUND(DBMS_RANDOM.VALUE()*nCities, 0) || ' City';
    INSERT INTO many_students
    VALUES (i, curCity);
  END LOOP;
  COMMIT;
END;

I then tried quite a few queries, such as:

select count(*) 
from many_students M 
where M.city = '5467 City';

and

select count(*) 
from many_students M1
join many_students M2 using(city);

and a few other ones.

I have seen this post and think that my queries satisfy the requirements stated in the replies there. However, none of the queries I tried showed dramatic improvement after building an index: create index myindex on many_students(city);

Am I missing some characteristic that distinguishes a query for which an index makes a dramatic difference? What is it?

Jon Heller · Accepted Answer

The test case is a good start but it needs a few more things to get a noticeable performance difference:

Realistic data sizes. One million rows of two small values is a small table. With a table that small the performance difference between a good and a bad execution plan may not matter much.

The below script will double the table size until it gets to 64 million rows. It takes about 20 minutes on my machine. (To make it go quicker, for larger sizes, you could make the table nologging and add an /*+ append */ hint to the insert.

--Increase the table to 64 million rows.  This took 20 minutes on my machine.
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
insert into many_students select * from many_students;
commit;

--The table has about 1.375GB of data.  The actual size will vary.
select bytes/1024/1024/1024 gb from dba_segments where segment_name = 'MANY_STUDENTS';

Gather statistics. Always gather statistics after large table changes. The optimizer cannot do its job well unless it has table, column, and index statistics.
```
begin
    dbms_stats.gather_table_stats(user, 'MANY_STUDENTS');
end;
/
```
Use hints to force a good and bad plan. Optimizer hints should usually be avoided. But to quickly compare different plans they can be helpful to fix a bad plan.

For example, this will force a full table scan:
```
select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
```
But you'll also want to verify the execution plan:
```
explain plan for select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
select * from table(dbms_xplan.display);
```
Flush the cache. Caching is probably the main culprit behind the index and full table scan queries taking the same amount of time. If the table fits entirely in memory then the time to read all the rows may be almost too small to measure. The number could be dwarfed by the time to parse the query or to send a simple result across the network.

This command will force Oracle to remove almost everything from the buffer cache. This will help you test a "cold" system. (You probably do not want to run this statement on a production system.)
```
alter system flush buffer_cache;
```
However, that won't flush the operating system or SAN cache. And maybe the table really would fit in memory on production. If you need to test a fast query it may be necessary to put it in a PL/SQL loop.
Multiple, alternating runs. There many things happening in the background, like caching and other processes. It's so easy to get bad results because something unrelated changed on the system.

Maybe the first run takes extra long to put things in a cache. Or maybe some huge job was started between queries. To avoid those issues, alternate running the two queries. Run them five times, throw out the highs and lows, and compare the averages.

For example, copy and paste the statements below five times and run them. (If using SQL*Plus, run set timing on first.) I already did that and posted the times I got in a comment before each line.
```
--Seconds: 0.02, 0.02, 0.03, 0.234, 0.02
alter system flush buffer_cache;
select count(*) from many_students M where M.city = '5467 City';

--Seconds: 4.07, 4.21, 4.35, 3.629, 3.54
alter system flush buffer_cache;
select /*+ full(M) */ count(*) from many_students M where M.city = '5467 City';
```
Testing is hard. Putting together decent performance tests is difficult. The above rules are only a start.

This might seem like overkill at first. But it's a complex topic. And I've seen so many people, including myself, waste a lot of time "tuning" something based on a bad test. Better to spend the extra time now and get the right answer.

Understanding characteristics of a query for which an index makes a dramatic difference

Answers (2)

Related Questions