Reputation: 48883
Suppose we have data like (Oracle syntax but that not significant):
create table EVENT (
UUID raw(16) default sys_guid(), -- no significant for question
TYPE number(2,0) not null,
DATEX date not null,
AMOUNT number(18,2) -- we use op: SUM, COUNT, AVG, STDDEV_POP, MEDIAN, etc
);
distinct TYPE
is limited to human manageable count (say 20), DATEX
is for last 10 years, and AMOUNT
is field for statistical analysis (like get histogram for given EVENT
of AMOUNT
by months in selected DATEX
period).
Number or rows is about 2e+6.
As all queries uses TYPE = n
and DATEX between DATE 'yyyy-mm-dd' and DATE 'yyyy-mm-dd'
restriction I decide make index for this field:
create index INDEX_EVENT_MAIN on EVENT (TYPE ASC, DATEX ASC);
With full scan queries performance better than with about x2-x5 times.
Another strategy is split data by event TYPE
across different individual tables like EVENT1, EVENT2, ... I use these tables without indexes at all. In this case queries performance in EVENTn table x2-x10 times better then in big EVENT
table for TYPE = n
(both full scan).
Also I make partitioning on EVENT table:
alter table EVENT add partition event_default values (DEFAULT);
alter table DATA_XX split partition event_default values(2) into (
partition event2,
partition event_default);
and queries performance on EVENT = 2
become same as with separate EVENT2
table.
I am not expert in DBA and that man that makes Web 2.0 corporate sites. So I can make experiments and guess but don't understand black box and can't interpret results on strong relational/algorithmic theory.
So have related questions:
Upvotes: 2
Views: 686
Reputation: 36892
Full table scans work better for reading "large" amounts of data, indexes work better for reading "small" amounts of data. Finding the right size depends mostly on the index clustering factor and single-block IO versus multi-block IO.
Index Clustering Factor
As Ronnis mentioned, the amount of time spent on I/O depends on the number of blocks read from disk. But reading multiple rows from the same block at one time is usually very cheap - once the block is in memory scanning through the rows is fast.
The real issue is that depending on how the data is ordered on disk, reading a small percentage of the rows could require reading a large percentage of the data. Some indexes are inefficient because of how the table data was created.
The index clustering factor is a measure of how ordered the data is. The number is an
estimate of the "number of I/Os required to read an entire table by means of an index". That number can be found in DBA_INDEXES.CLUSTERING_FACTOR
.
You can often optimize an index by rebuilding the table with the rows sorted a specific way. But that only works for one index.
Single-Block versus Multi-Block I/O
Oracle can read either one block at a time or multiple blocks at a time.
For reading a single value from a single row, obviously the fastest approach is to read as little data as possible. Access methods like index range scans always use single-block I/O, access methods like full table scans and fast full index scans always use multi-block I/O.
For large amounts of data, reading data in large chunks is much more efficient than reading one-at-a-time. I can't give you a good explanation of how disk heads seek, read data from sectors, etc. The details are unimportant, there's a generic lesson here about database performance - always process in batches.
You may even be able to figure out the single-block vs. multi-block time on your system. If you have system statistics gathered, and they are accurate, you can use these two queries to figure out the time to read a block one-at-a-time vs. in a batch:
--Blocks read per multi-block read, if you set the value yourself.
select value from v$parameter where name = 'db_file_multiblock_read_count';
--Time to read single and multiple blocks, in milliseconds.
--And average blocks per multi-block read.
select * from sys.aux_stats$ where pname in ('SREADTIM', 'MREADTIM', 'MBRC');
Upvotes: 1
Reputation: 12843
I'll answer your questions, but first some background needed to understand the answers:
The time taken to perform a full scan is pretty much determined by the throughput of the hard drives on which your data resides. If your disk can deliver 200 mb/s, it will take ~1 second to perform a full table scan of a table with 200 mb of data, regardless of the nr of rows.
Image a 200 mb table without any indexes, but where a column ID
is unique within the data. In this case both of the below queries will take the same time, because the bulk of the time is spent waiting for hard drives to hand data to the Oracle process.
The first query will take a long time because of all of the data Oracle has to wade through in order to find the row which satisfies id = 1
. The second query will take a long time because of all the data Oracle has to wade through in order to aggregate all values for one_column
and another_column
.
select id, one_column, another_column
from two_hundred_mb_table
where id = 1
select sum(one_column) / sum(another_column)
from two_hundred_mb_table
If you were to add an index to column ID, everything changes. The first query would now only have to visit the index for ID = 1, pick up the "rowid" which is a the physical address of the row in the data file, request the "block" on disk and then pick out the row. The first query is now a lot faster because of all the data it doesn't have to wade through.
The crucial point here is that even though you have indexed the column, you still can't pick the row directly from disk. You still have to pick up the entire block (typically ~8kb) from disk. With an average row length of say 100 bytes, it means that block held 82 rows. So you read 82 rows in order to find your one row.
This is why you typically can't read a lot of rows via an index before it becomes slower than a table scan. The reason is that you may end up re-reading the same block over and over again. And of course there is a breaking point (which is different in every case) of when reading the data via full table scan becomes faster than via index.
Now, on to your questions:
1. Do indexes not work for statistic queries (process a wide range of rows), and full scan better? The answer to this lies in the above text. It has nothing to do with sum/count or indexes, it has to do with the amount of data in the table, and if there is an efficient access path into the subset of interest.
2. Do indexes used only for point (non-range, get by ID) queries (non wide range)? Also here, the answer lies in the above text. You can use range-queries on indexes, but again if the subset of interest is to large, you're better of with a full table scan.
3. Do table splitting or partitioning is only way to boost query performance for statistic/aggregation queries?
If the table is 2,000 mb, and your disk can return 200 mb/s, it will take you 10 seconds to perform a full table scan. Assuming uniform data distribution on type
and you have 10 distinct values for it, you could list-partition the table by type
. In this case, every partition would be 200 mb, and so any query on type=n
would take 1 second instead of 10 seconds. However all queries without type=n
would still take 10 seconds.
You can also make a range-partition on the datex
column, for example make one partition per month. Again assuming that the table is 2000 mb with uniform data distribution, you would end up with 1/12 of the data in each partition.
You can also make a combination of these, and partition by LIST(event) and RANGE(datex).
If you still cannot meet the performance requirements, you can look into creating aggregate tables (or materialized views). For example, if you find yourself doing a lot of analysis on larger timespans, you can aggregate the data by month, and perform the higher level queries on the aggregated data. Once you have found a month where you need to "drill down" into you can use the event table again with predicate that pretty much hits one partition.
Upvotes: 3