fhucho
fhucho

Reputation: 34560

Is full table scan needed for counting rows with some attribute bigger than x?

Let's say there a table of people, with an age column which is indexed. How fast would be a query to count people older than 20: SELECT COUNT(*) FROM people WHERE age > 20? Is full table scan required? The database is MySQL.

Upvotes: 1

Views: 1340

Answers (2)

Rick James
Rick James

Reputation: 142443

There are some errors in the Accepted Answer. Rather than dissect that Answer, I will start fresh:

Given SELECT COUNT(*) FROM people WHERE age > 20, here is the performance for InnoDB, fastest first:

1. `INDEX(age)`            -- Range scan within the index
2. `INDEX(age, ...)`       -- Range scan within the index
3. `INDEX(foo, age)`       -- Full Index scan
4. `PRIMARY KEY(age, ...)` -- Range scan within the table
5. No indexes              -- Table scan needed
6. `PRIMARY KEY(foo, ...)  -- Table scan needed (same as "No index")

Notes and caveats:

  • INDEX(age, ...) is a littler slower than INDEX(age) only because the index is bulkier.
  • Any secondary index containing all the columns mentioned anywhere in the SELECT (just age, in this example) is called a "covering" index. EXPLAIN will say Using index (not to be confused with Using index condition). A covering index is faster than other secondary indexes. (If we had another column in the select, I could say more.)
  • Note "Range scan" vs "scan" -- This is where the processing can drill down the BTree index (primary or secondary) to where age = 20 and scan forward. That is, it does not need to scan the entire table or index, hence Range scan" is faster than "scan".
  • Items 3 and 4 may not be in the correct order. Item 3 may be faster when the index is significantly less bulky than the table. Item 4 may be faster when the range is a small fraction of the table. Because of this "maybe", I can't say "a covering index is always faster than using the PRIMARY KEY". Instead I can only say "usually faster".
  • A million rows is likely to have only 3 levels of BTree. However this part of the computation is almost never worth pursuing. (Rule of Thumb: each level of an index or table BTree fans out by a factor of 100.)
  • If the necessary part of the data or index is not already cached in RAM, then there will be I/O -- this can drastically slow down any of the cases. It can even turn the fastest case into slower than all the rest.
  • If the the data/index is too big to be cached, then there will always be I/O. In this case the ordering will stay roughly the same, but the differences will be more pronounced. (For example, "bulkier" becomes a significant factor.)
  • SELECT name FROM t WHERE age>20 is a different can of worms. Some of what I have said does not carry over to it. (Ask another Question if you want me to spell that out. It will have more cases.)
  • MyISAM and MEMORY have differences relative to InnoDB.

Upvotes: 1

Charles Bretana
Charles Bretana

Reputation: 146557

if the column age is not indexed, then yes, a full table scan is required. Even if it is indexed, if the data distribution of age values is such that there are more than a certain threshold percentage of the records that have age > 20, then a table scan is required anyway. it works this way, for each row that would be returned by the query, the processor must execute n disk IO operations, where n is the number of levels in the index... If there are, say a million rows in the table, and the index on age is 5 levels deep, then if there are more than 200k rows with age value > 20 then for each of those rows the processor has to execute 5 I/Os, for a total of 200k * 5 = 1 million I/Os, so, the optimizer says, if my statistics indicate that more than 200k rows would be returned, I might as well do a complete table scan, that will require less than 1 Million I/Os.

The only exception to this is if the entire table is clustered on the age column, then you only need to traverse the index for the boundaries of the age range you want to filter on.

Upvotes: 1

Related Questions