dened
dened

Reputation: 4310

Why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?

First, I will describe a simplified version of the problem domain.

There is table strings:

CREATE TABLE strings (
  value CHAR(3) COLLATE utf8_unicode_ci NOT NULL,
  INDEX(value)
) ENGINE=InnoDB;

As you can see, it have a non-unique index of CHAR(3) column.

The table is populated using the following script:

CREATE TABLE a_variants (
  letter CHAR(1) COLLATE utf8_unicode_ci  NOT NULL
) ENGINE=MEMORY;

INSERT INTO a_variants VALUES -- 60 variants of letter 'A'
  ('A'),('a'),('À'),('Á'),('Â'),('Ã'),('Ä'),('Å'),('à'),('á'),('â'),('ã'),
  ('ä'),('å'),('Ā'),('ā'),('Ă'),('ă'),('Ą'),('ą'),('Ǎ'),('ǎ'),('Ǟ'),('ǟ'),
  ('Ǡ'),('ǡ'),('Ǻ'),('ǻ'),('Ȁ'),('ȁ'),('Ȃ'),('ȃ'),('Ȧ'),('ȧ'),('Ḁ'),('ḁ'),
  ('Ạ'),('ạ'),('Ả'),('ả'),('Ấ'),('ấ'),('Ầ'),('ầ'),('Ẩ'),('ẩ'),('Ẫ'),('ẫ'),
  ('Ậ'),('ậ'),('Ắ'),('ắ'),('Ằ'),('ằ'),('Ẳ'),('ẳ'),('Ẵ'),('ẵ'),('Ặ'),('ặ');

INSERT INTO strings
  SELECT CONCAT(a.letter, b.letter, c.letter) -- 60^3 variants of string 'AAA'
    FROM a_variants a, a_variants b, a_variants c
  UNION ALL SELECT 'BBB'; -- one variant of string 'BBB'

So, it contains 216000 indistinguishable (in terms of the utf8_unicode_ci collation) variants of string "AAA" and one variant of string "BBB":

SELECT value, COUNT(*) FROM strings GROUP BY value;
+-------+----------+
| value | COUNT(*) |
+-------+----------+
| AAA   |   216000 |
| BBB   |        1 |
+-------+----------+

As value is indexed, I expect the following two queries to have similar performance:

SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA';
SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB';

But in practice the first one is more than 300x times slower than the second! See:

+----------+------------+---------------------------------------------------------------+
| Query_ID | Duration   | Query                                                         |
+----------+------------+---------------------------------------------------------------+
|        1 | 0.11749275 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
|        2 | 0.00033325 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'BBB' |
|        3 | 0.11718050 | SELECT SQL_NO_CACHE COUNT(*) FROM strings WHERE value = 'AAA' |
+----------+------------+---------------------------------------------------------------+

-- I ran the 'AAA' query twice here just to be sure.

If I change size of the indexed column or change its type to VARCHAR, the problem with performance still manifests itself. Meanwhile, in analogous situations, but when the non-unique index is not CHAR/VARCHAR (e.g. INT), queries are as fast as expected.

So, the question is why performance of MySQL queries are so bad when using a CHAR/VARCHAR index?

I have strong feeling that MySQL perform full linear scan of all the values matched by the index key. But why it do so when it can just return the count of the matched rows? Am I missing something and that is really needed? Or is that a sad shortcoming of MySQL optimizer?

Upvotes: 1

Views: 1231

Answers (2)

Gordon Linoff
Gordon Linoff

Reputation: 1269443

Clearly, the issue is that the query is doing an index scan. The alternative approach would be to do two index lookups, for the first and last values that are the same, and then use meta information in the index for the calculation. Based on your observations, MySQL does both.

The rest of this answer is speculation.

The reason the performance is "only" 300 times slower, rather than 200,000 times slower, is because of overhead in reading the index. Actually scanning the entries is quite fast compared to other operations that are needed.

There is a fundamental difference between numbers and strings when it comes to comparisons. The engine can just look at the bit representations of two numbers and recognize whether they are the same or different. Unfortunately, for strings, you need to take encoding/collation into account. I think that is why it needs to look at the values.

It is possible that if you had 216,000 copies of exactly the same string, then MySQL would be able to do the count using metadata in the index. In other words, the indexer is smart enough to use metadata for exact equality comparisons. But, it is not smart enough to take encoding into account.

Upvotes: 1

TommCatt
TommCatt

Reputation: 5636

One of things you may want to check on is the logical I/O of each query. I'm sure you'll see quite a difference. To count the number of 'BBB's in the table, probably only 3 or 4 LIOs are needed (depending on things like bucket size). To count the number of 'AAA's, essentially the entire table must be scanned, index or not. With 216k rows, that can add up to significantly more LIOs -- not to mention physical I/Os. Logical I/Os are faster than physical I/Os, but any I/O is a performance killer.

As for text vs numbers, it is always easier and faster for software (any software, not just database engines) to compare numbers than text.

Upvotes: 0

Related Questions