burger
burger

Reputation: 5883

Why does greater-than versus equals make a difference in MySQL SELECT?

I have a large MyISAM table. It's approaching 1 million rows. It's basically a list of items and some information about them.

There are two indices:

I run two queries:

SELECT * FROM table WHERE date = '2011-02-01' AND col < 5 LIMIT 10

SELECT * FROM table WHERE date < '2011-02-01' AND col < 5 LIMIT 10

The first one finishes in ~0.0005 seconds and the second in ~0.05 seconds. That is 100X difference. Is it wrong for me to expect both of these to run at roughly the same speed? I must not be understanding the indices very well. How can I speed up the second query?

Upvotes: 8

Views: 3725

Answers (4)

James
James

Reputation: 3892

MySQL stores its indexes by default in a BTREE. No hashing in general.

The short answer for the performance difference is that the < form evaluates more nodes then the = form.

The index that you've got on there (date, col) stores the values roughly like a phone book:

2011-01-01, col=1, row_ptr
2011-01-01, col=2, row_ptr
2011-01-01, col=3, row_ptr
etc...
2011-02-01, col=1, row_ptr
2011-02-01, col=2, row_ptr
2011-02-01, col=3, row_ptr
etc...
2011-02-02, col=1, row_ptr
2011-02-02, col=2, row_ptr
etc...

...in ascending sorted tree nodes of size B (2011-01-01, col=1) < (2011-01-01, col=2) < (2011-01-02, col=1).

Your question is essentially asking the difference between:

  1. Find all phone numbers with last name 'Smith' and first name starting with 'A'
  2. Find all phone numbers that come before 'Smith' and have first name starting with 'A'.

It should be obvious why #1 is so much faster then #2.

There are also considerations of memory /disk transfer efficiency and heap allocations (= does WAY fewer transfers then <) that account for a not-insignificant amount of time but depend largely on the distribution of the data and the specific location of the 2011-02-01, col=min(col) key record.

Upvotes: 2

Adam Gent
Adam Gent

Reputation: 49085

Regardless of Mysql it boils down to basic algorithm theory.

Greater than and Less than operations on a large set are slower than Identity operations. With a large data set an ideal data structure for determining less than or greater is a self balancing tree (binary or n-tree). On a a self balanced tree the worst case scenario to find all less/greater is log n.

The ideal data structure for identity lookup is a hashtable. The performance of hashtables is generally O(1) aka fixed time. A hashtable however is not good for greater/less.

Generally a well balanced tree is only slightly less performing than a hashtable (which is how Haskell gets away with using a tree for hashtables).

Thus irregardless of what Mysql does its not surprise that <,> is slower than =

Old Answer below:

Because the first one is like Hashtable lookup since its '=' (particularly if your index is a hashtable) it will be faster than the second one which might work better with a tree like index.

Since MySql allows to configure the index format you can try changing that but I'm rather sure the first will always run faster than the second.

Upvotes: 5

user183037
user183037

Reputation: 2579

I'm assuming you have an index on the date column. The first query uses the index, the second query probably does a linear scan (at least over part of the data). A direct fetch is always faster than a linear scan.

Upvotes: 2

Tushar
Tushar

Reputation: 21

The first one performs a seek over data where as the second one goes for a scan . Scans are always costlier than seeks hence the time difference .

Its like that, the the scan means running through all the pages of the book where as seek is directly jumping to a page number.

Hope this might help.

Upvotes: 1

Related Questions