Reputation: 338

Sorting Japanese Characters (Kanji, Hiragana, Katakana) in MySQL

I'm familiar with the different MySQL collations, my DB is currently set UTF8_general_ci, and I was searching semi-successfully using UTF8_unicode_ci. I'm running mysql 5.6, and I can't use the newest collations, without a headache in upgrading.

I say semi-successfully, as ご is returning こ when using UTF8_unicode_ci; I can live with that if I can solve the big issue below.

What I'm trying to achieve is searching my city column and returning a row, so Yokohama and 横浜 and よこはま and ヨコハマ all return the same row in querying the DB. Is this possible?

Upvotes: 3

Answers (2)

daxim

Reputation: 39158

I used libkakasi in the past, though I collated in application space, not in the database.

› echo -n '横浜' | kakasi -i utf8 -o utf8 -JH
よこはま
› echo -n 'ヨコハマ' | kakasi -i utf8 -o utf8 -KH
よこはま

You could normalise everything to hiragana first, store these strings additionally in the database, perform collation on them with an existing Unicode collation, and when you found a match, fetch the corresponding original unnormalised string instead.

Upvotes: 1

polm23

Reputation: 15593

This is a difficult problem, and I suspect you can't easily solve it just using MySQL.

There's a program called Migemo which deals with this problem to a certain extent in different contexts. It generates a regex to match unconverted input to a document. You can see an online demo of one version with output here. For example, the regex for the input "toukyou" is:

[とトﾄ][うウｳ][きキｷ][ょョｮ][うウｳ]|とうきょう|当協会|東京|東教|ｔｏｕｋｙｏｕ|([tţťŧŢŤŦ]|t[¸ˇ-])([oòóôõöøǿōŏőǒǫǭÒÓÔÕÖØǾŌŎŐǑǪǬ]|o([ˋ`ˊ´ˆ^˜~¨/ˉ¯˘˝ˇ˛]|/[ˊ´]|[ˊ´]/|˛[ˉ¯]|[ˉ¯]˛))([uùúûüũūŭůűųǔǖǘǚǜÙÚÛÜŨŪŬŮŰŲǓǕǗǙǛ]|u([ˋ`ˊ´ˆ^˜~¨˚°ˉ¯˘˛ˇ]|¨[ˉ¯]|[ˉ¯]¨|¨[ˊ´]|[ˊ´]¨|¨ˇ|ˇ¨|¨[ˋ`]|[ˋ`]¨))([kķĸǩĶǨ]|k[¸ˇ])([yỳýÿŷỲÝŸŶ]|y[ˋ`ˊ´¨ˆ^])([oòóôõöøǿōŏőǒǫǭÒÓÔÕÖØǾŌŎŐǑǪǬ]|o([ˋ`ˊ´ˆ^˜~¨/ˉ¯˘˝ˇ˛]|/[ˊ´]|[ˊ´]/|˛[ˉ¯]|[ˉ¯]˛))([uùúûüũūŭůűųǔǖǘǚǜÙÚÛÜŨŪŬŮŰŲǓǕǗǙǛ]|u([ˋ`ˊ´ˆ^˜~¨˚°ˉ¯˘˛ˇ]|¨[ˉ¯]|[ˉ¯]¨|¨[ˊ´]|[ˊ´]¨|¨ˇ|ˇ¨|¨[ˋ`]|[ˋ`]¨))

Since Migemo is for matching from partial non-converted input, it expects ASCII input, so you can't use it as-is. However, the basic strategy of Migemo - convert your input to a regex matching all variants - can be used to search your MySQL database.

As a commenter mentioned this is hard if you want to support place names with idiosyncratic romanization; do you need to recognize both "Shinbashi" and "Shimbashi", for example? Even "Tokyo" requires an exception if you're just working from hiragana input. For this I recommend you build your own list, potentially using romaji data from JP Post. It won't deal with all variations but it will get the standard ones.

Hope that helps.

Upvotes: 1

Sorting Japanese Characters (Kanji, Hiragana, Katakana) in MySQL

Answers (2)

Related Questions