Reputation: 338
I'm familiar with the different MySQL collations, my DB is currently set UTF8_general_ci
, and I was searching semi-successfully using UTF8_unicode_ci
. I'm running mysql 5.6, and I can't use the newest collations, without a headache in upgrading.
I say semi-successfully, as ご
is returning こ
when using UTF8_unicode_ci
; I can live with that if I can solve the big issue below.
What I'm trying to achieve is searching my city column and returning a row, so Yokohama
and 横浜
and よこはま
and ヨコハマ
all return the same row in querying the DB. Is this possible?
Upvotes: 3
Views: 2278
Reputation: 39158
I used libkakasi in the past, though I collated in application space, not in the database.
› echo -n '横浜' | kakasi -i utf8 -o utf8 -JH
よこはま
› echo -n 'ヨコハマ' | kakasi -i utf8 -o utf8 -KH
よこはま
You could normalise everything to hiragana first, store these strings additionally in the database, perform collation on them with an existing Unicode collation, and when you found a match, fetch the corresponding original unnormalised string instead.
Upvotes: 1
Reputation: 15593
This is a difficult problem, and I suspect you can't easily solve it just using MySQL.
There's a program called Migemo which deals with this problem to a certain extent in different contexts. It generates a regex to match unconverted input to a document. You can see an online demo of one version with output here. For example, the regex for the input "toukyou" is:
[とトト][うウウ][きキキ][ょョョ][うウウ]|とうきょう|当協会|東京|東教|toukyou|([tţťŧŢŤŦ]|t[¸ˇ-])([oòóôõöøǿōŏőǒǫǭÒÓÔÕÖØǾŌŎŐǑǪǬ]|o([ˋ`ˊ´ˆ^˜~¨/ˉ¯˘˝ˇ˛]|/[ˊ´]|[ˊ´]/|˛[ˉ¯]|[ˉ¯]˛))([uùúûüũūŭůűųǔǖǘǚǜÙÚÛÜŨŪŬŮŰŲǓǕǗǙǛ]|u([ˋ`ˊ´ˆ^˜~¨˚°ˉ¯˘˛ˇ]|¨[ˉ¯]|[ˉ¯]¨|¨[ˊ´]|[ˊ´]¨|¨ˇ|ˇ¨|¨[ˋ`]|[ˋ`]¨))([kķĸǩĶǨ]|k[¸ˇ])([yỳýÿŷỲÝŸŶ]|y[ˋ`ˊ´¨ˆ^])([oòóôõöøǿōŏőǒǫǭÒÓÔÕÖØǾŌŎŐǑǪǬ]|o([ˋ`ˊ´ˆ^˜~¨/ˉ¯˘˝ˇ˛]|/[ˊ´]|[ˊ´]/|˛[ˉ¯]|[ˉ¯]˛))([uùúûüũūŭůűųǔǖǘǚǜÙÚÛÜŨŪŬŮŰŲǓǕǗǙǛ]|u([ˋ`ˊ´ˆ^˜~¨˚°ˉ¯˘˛ˇ]|¨[ˉ¯]|[ˉ¯]¨|¨[ˊ´]|[ˊ´]¨|¨ˇ|ˇ¨|¨[ˋ`]|[ˋ`]¨))
Since Migemo is for matching from partial non-converted input, it expects ASCII input, so you can't use it as-is. However, the basic strategy of Migemo - convert your input to a regex matching all variants - can be used to search your MySQL database.
As a commenter mentioned this is hard if you want to support place names with idiosyncratic romanization; do you need to recognize both "Shinbashi" and "Shimbashi", for example? Even "Tokyo" requires an exception if you're just working from hiragana input. For this I recommend you build your own list, potentially using romaji data from JP Post. It won't deal with all variations but it will get the standard ones.
Hope that helps.
Upvotes: 1