Reputation: 156
Edit:The time spent for querying a normal word is actually 1.78 seconds. The 4.5 seconds mentioned in the original post below was when querying special words like '.vnet'. (I know REGEXP '\\b.vnet\\b'
won't find the whole word match for '.vnet'. I might use a more complex regex to fix this later, or drop the support for '.vnet' if it's too time-consuming.) Also I added solution 5 below.
I have the following MySQL query to achieve whole word matching.
SELECT source, target
FROM tm
WHERE source REGEXP '\\bword\\b'
AND customer = 'COMPANY X'
AND language = 'YYY'
ORDER BY CHAR_LENGTH(source)
LIMIT 5;
There are 2 customers and 2 languages currently.
My goal is to find the top 5 closest matches of a phrase among hundreds of thousands of English sentences. The reason the fetched records are ordered by CHAR_LENGTH
is because the shorter the length, the higher the match ratio, since REGEXP '\\bword\\b'
makes sure source
has word
already.
The tm table:
CREATE TABLE tm(
id INT AUTO_INCREMENT PRIMARY KEY,
source TEXT(7000) NOT NULL,
target TEXT(6000) NOT NULL,
language CHAR(3),
customer VARCHAR(10),
INDEX src_cus_lang (source(755), customer, language)
The query above took about 4.5 seconds to finish, which is very slow for me and my PC that has an Intel Core i5-10400F, 16GB RAM and an SSD.
The EXPLAIN command showed the below result:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: tm
partitions: NULL
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 1117154
filtered: 1.00
Extra: Using where; Using filesort
1 row in set, 1 warning (0.00 sec)
I tried to delelte the src_cus_lang
index and created a new one (customer, language, source(755))
, but no improvement at all.
I can think of a few solutions:
tm
table, ordering by CHAR_LENGTH(source) in the process. This is not ideal for me as I'd like to keep the original order of the table.src_len
, i.e. the length of the source. However, ORDER BY src_len
is still very slow.tm
table into 4 separate ones by customer
and language
. Not ideal for me.source
column. Still very slow.INDEX(customer, language)
. Took 1.4 seconds longer for both normal words and special words like '.vnet'.Is there a way to cut the execution time down to less than 0.5 seconds?
Upvotes: 1
Views: 158
Reputation: 142278
This is essentially useless:
INDEX src_cus_lang (source(755), customer, language)
The prefixing keeps the rest of the columns from being very useful. REGEXP
requires checking all 1.1M rows.
This would be better:
INDEX(customer, language)
It will at least filter on those two columns, then apply the REGEXP
fewer times.
Since it usually wants to finish with the WHERE
before considering the ORDER BY
, your attempts at src_len, etc, did not help.
If there are only 4 different combinations of customer
and language
, not much can be done.
However, you should consider a FULLTEXT(source)
index. With such,
MATCH(source) AGAINST('+word' IN BOOLEAN MODE)
AND ...
will work much faster.
Also try IN NATURAL LANGUAGE MODE
.
Upvotes: 2