John Mellor
John Mellor

Reputation: 2503

MySQL full text search for Scandinavian Characters ø æ

This is the only SO q/a I can find on this subject but it doesn't seem to answer the question: Fulltext search with nordic letters (æ, ø, å)

In MySQL I have a user called "John Nørbæk".

I would like to be able to match this user when searching for:

"Noerbaek", "Nørbaek" or "Noerbæk"

(It would be nice to be able to match for "Norbaek" as well, but my Googling has suggested that ø is equivalent to "oe" rather than "o" so it now feels wrong to request that).

I am using utf8mb4_unicode_ci

I have created an SQL Fiddle here: http://sqlfiddle.com/#!9/b5be3d4/3

Every result i've come up with online seems to be related to PHP and replacing those characters with a str_replace, but if what is in my database is "Nørbæk" then that's not that helpful.

I can't do something like this because the matched string needs to be in the full text index:

SELECT * FROM users WHERE MATCH (REPLACE(`name`, 'æ', 'ae'))
AGAINST ('Nørbaek' IN BOOLEAN MODE);

This would work, but would of course fail when searching for something like "Michael" because it would be converted to Michæl so that's out of the question.

SELECT * FROM users WHERE MATCH (`name`)
AGAINST (REPLACE('Nørbaek', 'ae', 'æ') IN BOOLEAN MODE);

So i'm all out of ideas, there's got to be a better way to match these characters right?

I am more interested in usability than anything else.

As an example if an elderly user wants to find their "Dr Nørbæk" they need to be able to type something in that will find him and not simply be told to learn how to type an "ø".

Edit 1: One comment has suggested that I should use utf8mb4_swedish_ci but I have names from all kinds of languages in my database, including Chinese, so I figure I should be using utf8mb4_unicode_ci although i'll be honest in saying that I simply always use utf8mb4_unicode_ci as a "best practice" without fully understanding the reasons why.

Edit 2: A comment suggested utf8_unicode_520_ci and that seems to work, as seen in this fiddle: http://sqlfiddle.com/#!9/6a604e/4 I will need to do some more tests to see what other affects it has.

Upvotes: 1

Views: 1478

Answers (1)

Rick James
Rick James

Reputation: 142278

As John demonstrated,

MATCH(col) AGAINST('Nørbæk' ...)

succeeds for

  • IN BOOLEAN MODE and IN NATURAL LANGUAGE MODE.
  • col = 'Norbaek',
  • other variants of Nørbæk in the AGAINST,
  • 5.6 (and presumaby later versions),
  • with collation utf8_unicode_520_ci (or utf8mb4_unicode_520_ci).

http://mysql.rjweb.org/utf8_collations.html and http://mysql.rjweb.org/utf8mb4_collations.html show the following equivalence classes for the 520 collation for either character set:

A=a=ª=À=Á=Â=Ã=Ä=Å=à=á=â=ã=ä=å=Ā=ā=Ă=ă=Ą=ą
ae=Æ=æ
O=o=º=Ò=Ó=Ô=Õ=Ö=Ø=ò=ó=ô=õ=ö=ø=Ō=ō=Ŏ=ŏ=Ő=ő=Ơ=ơ
oe=Œ=œ 

So, given that collation, there is no need to do the REPLACEs; instead specify that collation (utf8_unicode_520_ci or utf8mb4_unicode_520_ci) for the column

Upvotes: 2

Related Questions