Tim Tisdall
Tim Tisdall

Reputation: 10382

Can Sphinx handle unicode normalization forms?

I'm aware of the charset_table setting to allow U+00E9 -> e which will map 'é' to 'e'. However if instead of U+00E9 you have U+0065 U+0301 (which is the "decomposed" form of 'é' which is just 'e' followed by an acute accent) then Sphinx will treat the U+0301 as whitespace and break up the word.

example:

mysql> CALL KEYWORDS('Crème brûlée', 'recipes_rt', 1);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1    | creme     | creme      | 3    | 3    |
| 2    | brulee    | brulee     | 2    | 2    |
+------+-----------+------------+------+------+
2 rows in set (0.00 sec)

mysql> CALL KEYWORDS('Crème brûlée', 'recipes_rt', 1);
+------+-----------+------------+------+------+
| qpos | tokenized | normalized | docs | hits |
+------+-----------+------------+------+------+
| 1    | creme     | creme      | 3    | 3    |
| 2    | brule     | brule      | 0    | 0    |
| 3    | e         | e          | 3    | 3    |
+------+-----------+------------+------+------+
3 rows in set (0.15 sec)

Something like NFKC Unicode normalization is needed here, but I can't see any mention of that in the documentation.

Upvotes: 1

Views: 257

Answers (1)

barryhunter
barryhunter

Reputation: 21091

Not sure how to handle it 'scalably' (ie all the forms), but the individuals could be probably be done with regexp_filter?

http://sphinxsearch.com/docs/current/conf-regexp-filter.html

regexp_filter = \%u0065\%u0301 => e

Although having said that perhaps, just add U+0301 (and other 'combining' chars) to ignore_chars? http://sphinxsearch.com/docs/current/conf-ignore-chars.html

They disappear leaving jsut the 'unaccented' char (e)

Upvotes: 1

Related Questions