zsfzu0
zsfzu0

Reputation: 156

Simple MySQL query with ORDER BY too slow

Edit:The time spent for querying a normal word is actually 1.78 seconds. The 4.5 seconds mentioned in the original post below was when querying special words like '.vnet'. (I know REGEXP '\\b.vnet\\b' won't find the whole word match for '.vnet'. I might use a more complex regex to fix this later, or drop the support for '.vnet' if it's too time-consuming.) Also I added solution 5 below.


I have the following MySQL query to achieve whole word matching.

SELECT  source, target
    FROM  tm
    WHERE  source REGEXP '\\bword\\b'
      AND  customer = 'COMPANY X'
      AND  language = 'YYY'
    ORDER BY  CHAR_LENGTH(source)
    LIMIT  5;

There are 2 customers and 2 languages currently.

My goal is to find the top 5 closest matches of a phrase among hundreds of thousands of English sentences. The reason the fetched records are ordered by CHAR_LENGTH is because the shorter the length, the higher the match ratio, since REGEXP '\\bword\\b' makes sure source has word already.

The tm table:

CREATE TABLE tm(
    id INT AUTO_INCREMENT PRIMARY KEY,
    source TEXT(7000) NOT NULL,
    target TEXT(6000) NOT NULL,
    language CHAR(3),
    customer VARCHAR(10),
    INDEX src_cus_lang (source(755), customer, language)

The query above took about 4.5 seconds to finish, which is very slow for me and my PC that has an Intel Core i5-10400F, 16GB RAM and an SSD.

The EXPLAIN command showed the below result:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tm
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1117154
     filtered: 1.00
        Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)

I tried to delelte the src_cus_lang index and created a new one (customer, language, source(755)), but no improvement at all.

I can think of a few solutions:

  1. Recreate the tm table, ordering by CHAR_LENGTH(source) in the process. This is not ideal for me as I'd like to keep the original order of the table.
  2. Create a new column named src_len, i.e. the length of the source. However, ORDER BY src_len is still very slow.
  3. Split the tm table into 4 separate ones by customer and language. Not ideal for me.
  4. Index the source column. Still very slow.
  5. Use INDEX(customer, language). Took 1.4 seconds longer for both normal words and special words like '.vnet'.

Is there a way to cut the execution time down to less than 0.5 seconds?

Upvotes: 1

Views: 158

Answers (1)

Rick James
Rick James

Reputation: 142278

This is essentially useless:

INDEX src_cus_lang (source(755), customer, language)

The prefixing keeps the rest of the columns from being very useful. REGEXP requires checking all 1.1M rows.

This would be better:

INDEX(customer, language)

It will at least filter on those two columns, then apply the REGEXP fewer times.

Since it usually wants to finish with the WHERE before considering the ORDER BY, your attempts at src_len, etc, did not help.

If there are only 4 different combinations of customer and language, not much can be done.

However, you should consider a FULLTEXT(source) index. With such,

MATCH(source) AGAINST('+word' IN BOOLEAN MODE)
AND ...

will work much faster.

Also try IN NATURAL LANGUAGE MODE.

Upvotes: 2

Related Questions