Reputation: 10382
I'm aware of the charset_table
setting to allow U+00E9 -> e
which will map 'é' to 'e'. However if instead of U+00E9 you have U+0065 U+0301 (which is the "decomposed" form of 'é' which is just 'e' followed by an acute accent) then Sphinx will treat the U+0301 as whitespace and break up the word.
example:
mysql> CALL KEYWORDS('Crème brûlée', 'recipes_rt', 1);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | creme | creme | 3 | 3 |
| 2 | brulee | brulee | 2 | 2 |
+------+-----------+------------+------+------+
2 rows in set (0.00 sec)
mysql> CALL KEYWORDS('Crème brûlée', 'recipes_rt', 1);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1 | creme | creme | 3 | 3 |
| 2 | brule | brule | 0 | 0 |
| 3 | e | e | 3 | 3 |
+------+-----------+------------+------+------+
3 rows in set (0.15 sec)
Something like NFKC Unicode normalization is needed here, but I can't see any mention of that in the documentation.
Upvotes: 1
Views: 257
Reputation: 21091
Not sure how to handle it 'scalably' (ie all the forms), but the individuals could be probably be done with regexp_filter
?
http://sphinxsearch.com/docs/current/conf-regexp-filter.html
regexp_filter = \%u0065\%u0301 => e
Although having said that perhaps, just add U+0301 (and other 'combining' chars) to ignore_chars
?
http://sphinxsearch.com/docs/current/conf-ignore-chars.html
They disappear leaving jsut the 'unaccented' char (e)
Upvotes: 1