CryMasK
CryMasK

Reputation: 303

MySQL - How to search a single character with fulltext index

We have an InnoDB table post, and create a FULLTEXT index on two column title & content with ngram parser.
Most of the data is Chinese characters, but contains some Japanese and English characters.

We use MySQL 8.0.15 and I already set the innodb_ft_min_token_size value to 1. Now I want to do something like this:

SELECT * FROM `post` WHERE MATCH (`title`, `content`) AGAINST ('e');

expected result:
title          content
------------------------
Food           noodle

or real-case example:

SELECT * FROM `post` WHERE MATCH (`title`, `content`) AGAINST ('麵');

expected result:
title           content
------------------------
本週推薦美食     到底該吃飯還是麵

It doesn't make sense in English, but sometime a single character has meaning in Chinese.
However, when I search a single character, it returns me an empty set. Only for some special characters, like Japanese character and , return a few results in boolean mode (it doesn't return all rows that contain the character).

I tried to use wildcard operator and query in boolean mode, but it is not working when the searching character is the last character of a sentence.

I also tried to set ngram_token_size to 1. It presents a strange result: all general search queries return nothing and those special characters still work as mentioned above. But if I search those special characters with WITH QUERY EXPANSION syntax, this time it will return more results than boolean mode (I'm not sure if these are all rows that contain the character).

Is it possible to use full-text search with a single character?

Upvotes: 1

Views: 3789

Answers (2)

CryMasK
CryMasK

Reputation: 303

I'm not sure if it is a special case for CJK language or ngram parser. It seems that create/rebuild the full-text index in a specific order will cause certain problems.

It is all well if I build the full-text index when creating a table, and insert data afterwards.

Nevertheless, in my case, the table already existed some data. Then I use ALTER TABLE post ADD FULLTEXT ft_search_index(title, CONTENT) WITH PARSER ngram; to build the index.
(There is an another issue if I follow this sequence to build the index, some searching queries can't match all expected results in boolean mode)


Back to the question...

Yes, it is possible to use full-text search with a single character. You should set ngram_token_size to 1.

However, there are some problems when rebuilding the index.

According to the document, after changing the setting, I can use ALTER TABLE to rebuild the index.
But no matter what I use this syntax

ALTER TABLE post DROP INDEX ft_search_index, ADD FULLTEXT ft_search_index(title, content) WITH parser ngram;

(It caused the situation what I mentioned in the question)

or split it into two sentences

ALTER TABLE post DROP INDEX ft_search_index;
ALTER TABLE post ADD FULLTEXT ft_search_index(title, content) WITH parser ngram;

(After executing these two sentences, I can use a full-text query with a single character, but with more characters, it is still possible not to match all expected results in boolean mode as stated above),

both of them can not work well.


Here is my solution:

After that, I try to ran the OPTIMIZE syntax to rebuild the index (As mentioned in the document, this will also rebuild the index).

OPTIMIZE TABLE post;

It is working without those problems.

(Check innodb_optimize_fulltext_only and innodb_ft_num_word_optimize before running OPTIMIZE TABLE for large table)

Upvotes: 1

Solarflare
Solarflare

Reputation: 11106

There is actually only a limited number of problems that could cause this behaviour for an ngram-parser:

  • ngram_token_size has to be set to 1 (in the configuration file or as a startup parameter, it is no dynamic variable), and requires a server restart. Verify with SHOW VARIABLES LIKE 'ngram_token_size' that the setting is active. innodb_ft_min_token_size (and ft_min_word_len for MyISAM) are not used for ngrams, but are oftentimes the reason that very short words cannot be found in "normal" setups, so I'll mention them here for completeness.
  • The fulltext index has to be rebuilt after the server restart. Verify that you are actually using with parser ngram.
  • Natural language fulltext searches on MyISAM-tables have a 50% threshold. This means that terms that occur in at least 50% of rows are ignored. For example, any search for a table with just two rows cannot produce a result
  • The fulltext index for InnoDB tables is only updated after a commit. Verify that you don't do the search in the same transaction as the insert and/or check the value for autocommit.
  • If an ngram contains a stopword, it will be ignored. By default, it only contains english words and should not overlap with chinese characters, but verify that noone changed it by checking the innodb_ft_user_stopword_table-variable (or ft_stopword_file for MyISAM).
  • Character set incompatibilies between your client connnection settings and the table column/server settings can produce unexpected (in-)equalities. (Like umlauts or case-(in)sensitivity, but I don't know if those are relevant in chinese). Try ... where content like '%麵'. If this doesn't find your row either, this is almost certainly the problem.

The following test on MySQL 8.0.15 (with ngram_token_size=1) finds the row:

CREATE TABLE `so_ngramtest` (
  `title` varchar(50) DEFAULT NULL,
  `content` varchar(50) DEFAULT NULL,
  FULLTEXT KEY `title` (`title`,`content`) WITH PARSER `ngram`  
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO `so_ngramtest` (`title`, `content`)
VALUES(_utf8mb4'本週推薦美食' COLLATE utf8mb4_0900_ai_ci, 
       _utf8mb4'到底該吃飯還是麵' COLLATE utf8mb4_0900_ai_ci);

COMMIT;

SELECT * FROM `so_ngramtest` 
WHERE MATCH (`title`, `content`) AGAINST (_utf8mb4'麵' COLLATE utf8mb4_0900_ai_ci);

Try to run this on your server. If this produces a result, find the deviation to your code (e.g. by adding table definition/collation settings to your question). If it doesn't, recheck all the settings in the list above (and add this result to your question).

Upvotes: 0

Related Questions