Can Sphinx handle unicode normalization forms?

Question

I'm aware of the charset_table setting to allow U+00E9 -> e which will map 'é' to 'e'. However if instead of U+00E9 you have U+0065 U+0301 (which is the "decomposed" form of 'é' which is just 'e' followed by an acute accent) then Sphinx will treat the U+0301 as whitespace and break up the word.

example:

mysql> CALL KEYWORDS('Crème brûlée', 'recipes_rt', 1);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1    | creme     | creme      | 3    | 3    |
| 2    | brulee    | brulee     | 2    | 2    |
+------+-----------+------------+------+------+
2 rows in set (0.00 sec)

mysql> CALL KEYWORDS('Crème brûlée', 'recipes_rt', 1);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1    | creme     | creme      | 3    | 3    |
| 2    | brule     | brule      | 0    | 0    |
| 3    | e         | e          | 3    | 3    |
+------+-----------+------------+------+------+
3 rows in set (0.15 sec)

Something like NFKC Unicode normalization is needed here, but I can't see any mention of that in the documentation.

Can Sphinx handle unicode normalization forms?

Answers (1)

Related Questions