Statictis/aggregation queries: full scan or index or partitioning?

Question

Suppose we have data like (Oracle syntax but that not significant):

create table EVENT (
    UUID raw(16) default sys_guid(),  -- no significant for question
    TYPE number(2,0) not null,
    DATEX date not null,
    AMOUNT number(18,2) -- we use op: SUM, COUNT, AVG, STDDEV_POP, MEDIAN, etc
);

distinct TYPE is limited to human manageable count (say 20), DATEX is for last 10 years, and AMOUNT is field for statistical analysis (like get histogram for given EVENT of AMOUNT by months in selected DATEX period).

Number or rows is about 2e+6.

As all queries uses TYPE = n and DATEX between DATE 'yyyy-mm-dd' and DATE 'yyyy-mm-dd' restriction I decide make index for this field:

create index INDEX_EVENT_MAIN on EVENT (TYPE ASC, DATEX ASC);

With full scan queries performance better than with about x2-x5 times.

Another strategy is split data by event TYPE across different individual tables like EVENT1, EVENT2, ... I use these tables without indexes at all. In this case queries performance in EVENTn table x2-x10 times better then in big EVENT table for TYPE = n (both full scan).

Also I make partitioning on EVENT table:

alter table EVENT add partition event_default values (DEFAULT);

alter table DATA_XX split partition event_default values(2) into (
  partition event2,
  partition event_default);

and queries performance on EVENT = 2 become same as with separate EVENT2 table.

I am not expert in DBA and that man that makes Web 2.0 corporate sites. So I can make experiments and guess but don't understand black box and can't interpret results on strong relational/algorithmic theory.

So have related questions:

do indexes not work for statistic queries (process a wide range of rows), and full scan better?
do indexes used only for point (non-range, get by ID) queries (non wide range)?
do table splitting or partitioning is only way to boost query performance for statistic/aggregation queries?

Ronnis · Accepted Answer

I'll answer your questions, but first some background needed to understand the answers:

The time taken to perform a full scan is pretty much determined by the throughput of the hard drives on which your data resides. If your disk can deliver 200 mb/s, it will take ~1 second to perform a full table scan of a table with 200 mb of data, regardless of the nr of rows.

Image a 200 mb table without any indexes, but where a column ID is unique within the data. In this case both of the below queries will take the same time, because the bulk of the time is spent waiting for hard drives to hand data to the Oracle process.

The first query will take a long time because of all of the data Oracle has to wade through in order to find the row which satisfies id = 1. The second query will take a long time because of all the data Oracle has to wade through in order to aggregate all values for one_column and another_column.

select id, one_column, another_column
  from two_hundred_mb_table
 where id = 1

select sum(one_column) / sum(another_column) 
  from two_hundred_mb_table

If you were to add an index to column ID, everything changes. The first query would now only have to visit the index for ID = 1, pick up the "rowid" which is a the physical address of the row in the data file, request the "block" on disk and then pick out the row. The first query is now a lot faster because of all the data it doesn't have to wade through.

The crucial point here is that even though you have indexed the column, you still can't pick the row directly from disk. You still have to pick up the entire block (typically ~8kb) from disk. With an average row length of say 100 bytes, it means that block held 82 rows. So you read 82 rows in order to find your one row.

This is why you typically can't read a lot of rows via an index before it becomes slower than a table scan. The reason is that you may end up re-reading the same block over and over again. And of course there is a breaking point (which is different in every case) of when reading the data via full table scan becomes faster than via index.

Now, on to your questions:

1. Do indexes not work for statistic queries (process a wide range of rows), and full scan better? The answer to this lies in the above text. It has nothing to do with sum/count or indexes, it has to do with the amount of data in the table, and if there is an efficient access path into the subset of interest.

2. Do indexes used only for point (non-range, get by ID) queries (non wide range)? Also here, the answer lies in the above text. You can use range-queries on indexes, but again if the subset of interest is to large, you're better of with a full table scan.

3. Do table splitting or partitioning is only way to boost query performance for statistic/aggregation queries?

If the table is 2,000 mb, and your disk can return 200 mb/s, it will take you 10 seconds to perform a full table scan. Assuming uniform data distribution on type and you have 10 distinct values for it, you could list-partition the table by type. In this case, every partition would be 200 mb, and so any query on type=n would take 1 second instead of 10 seconds. However all queries without type=n would still take 10 seconds.

You can also make a range-partition on the datex column, for example make one partition per month. Again assuming that the table is 2000 mb with uniform data distribution, you would end up with 1/12 of the data in each partition.

You can also make a combination of these, and partition by LIST(event) and RANGE(datex).

If you still cannot meet the performance requirements, you can look into creating aggregate tables (or materialized views). For example, if you find yourself doing a lot of analysis on larger timespans, you can aggregate the data by month, and perform the higher level queries on the aggregated data. Once you have found a month where you need to "drill down" into you can use the event table again with predicate that pretty much hits one partition.

Statictis/aggregation queries: full scan or index or partitioning?

Answers (2)

Related Questions