Reputation: 303
We have an InnoDB table post
, and create a FULLTEXT index on two column title
& content
with ngram parser.
Most of the data is Chinese characters, but contains some Japanese and English characters.
We use MySQL 8.0.15 and I already set the innodb_ft_min_token_size
value to 1. Now I want to do something like this:
SELECT * FROM `post` WHERE MATCH (`title`, `content`) AGAINST ('e');
expected result:
title content
------------------------
Food noodle
or real-case example:
SELECT * FROM `post` WHERE MATCH (`title`, `content`) AGAINST ('麵');
expected result:
title content
------------------------
本週推薦美食 到底該吃飯還是麵
It doesn't make sense in English, but sometime a single character has meaning in Chinese.
However, when I search a single character, it returns me an empty set. Only for some special characters, like Japanese character す
and し
, return a few results in boolean mode (it doesn't return all rows that contain the character).
I tried to use wildcard operator and query in boolean mode, but it is not working when the searching character is the last character of a sentence.
I also tried to set ngram_token_size
to 1. It presents a strange result: all general search queries return nothing and those special characters still work as mentioned above. But if I search those special characters with WITH QUERY EXPANSION
syntax, this time it will return more results than boolean mode (I'm not sure if these are all rows that contain the character).
Is it possible to use full-text search with a single character?
Upvotes: 1
Views: 3789
Reputation: 303
I'm not sure if it is a special case for CJK language or ngram parser. It seems that create/rebuild the full-text index in a specific order will cause certain problems.
It is all well if I build the full-text index when creating a table, and insert data afterwards.
Nevertheless, in my case, the table already existed some data. Then I use ALTER TABLE post ADD FULLTEXT ft_search_index(title, CONTENT) WITH PARSER ngram;
to build the index.
(There is an another issue if I follow this sequence to build the index, some searching queries can't match all expected results in boolean mode)
Yes, it is possible to use full-text search with a single character. You should set ngram_token_size
to 1.
However, there are some problems when rebuilding the index.
According to the document, after changing the setting, I can use ALTER TABLE
to rebuild the index.
But no matter what I use this syntax
ALTER TABLE post DROP INDEX ft_search_index, ADD FULLTEXT ft_search_index(title, content) WITH parser ngram;
(It caused the situation what I mentioned in the question)
or split it into two sentences
ALTER TABLE post DROP INDEX ft_search_index;
ALTER TABLE post ADD FULLTEXT ft_search_index(title, content) WITH parser ngram;
(After executing these two sentences, I can use a full-text query with a single character, but with more characters, it is still possible not to match all expected results in boolean mode as stated above),
both of them can not work well.
After that, I try to ran the OPTIMIZE
syntax to rebuild the index (As mentioned in the document, this will also rebuild the index).
OPTIMIZE TABLE post;
It is working without those problems.
(Check innodb_optimize_fulltext_only
and innodb_ft_num_word_optimize
before running OPTIMIZE TABLE
for large table)
Upvotes: 1
Reputation: 11106
There is actually only a limited number of problems that could cause this behaviour for an ngram-parser:
ngram_token_size
has to be set to 1
(in the configuration file or as a startup parameter, it is no dynamic variable), and requires a server restart. Verify with SHOW VARIABLES LIKE 'ngram_token_size'
that the setting is active. innodb_ft_min_token_size
(and ft_min_word_len
for MyISAM) are not used for ngrams, but are oftentimes the reason that very short words cannot be found in "normal" setups, so I'll mention them here for completeness.with parser ngram
. MyISAM
-tables have a 50% threshold. This means that terms that occur in at least 50% of rows are ignored. For example, any search for a table with just two rows cannot produce a resultautocommit
.innodb_ft_user_stopword_table
-variable (or ft_stopword_file
for MyISAM). ... where content like '%麵'
. If this doesn't find your row either, this is almost certainly the problem.The following test on MySQL 8.0.15 (with ngram_token_size=1
) finds the row:
CREATE TABLE `so_ngramtest` (
`title` varchar(50) DEFAULT NULL,
`content` varchar(50) DEFAULT NULL,
FULLTEXT KEY `title` (`title`,`content`) WITH PARSER `ngram`
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;
INSERT INTO `so_ngramtest` (`title`, `content`)
VALUES(_utf8mb4'本週推薦美食' COLLATE utf8mb4_0900_ai_ci,
_utf8mb4'到底該吃飯還是麵' COLLATE utf8mb4_0900_ai_ci);
COMMIT;
SELECT * FROM `so_ngramtest`
WHERE MATCH (`title`, `content`) AGAINST (_utf8mb4'麵' COLLATE utf8mb4_0900_ai_ci);
Try to run this on your server. If this produces a result, find the deviation to your code (e.g. by adding table definition/collation settings to your question). If it doesn't, recheck all the settings in the list above (and add this result to your question).
Upvotes: 0