Getting MySQL to properly distinguish Japanese characters in SELECT calls

Question

I'm setting up a database to do some linguistic analysis, and Japanese Kana are giving me just a bit of trouble.

Unlike other questions on this so far, I don't know that it's an encoding issue, per se. I've set the coallation to utf8_unicode_ci, and on the surface it's saving and recalling most things all right.

The problem, however, is when I get into related kana, such as キ (ki) and ギ (gi). For sorting purposes, Japanese doesn't distinguish between the two unless they are in direct conflict. So for example:

ぎ (gi) comes before きかい (kikai)
きる (kiru) comes before ぎわく (giwaku)
き (ki) comes before ぎ (gi)

It's this behavior that I think is at the root of my problem. When loading my data set from an external file, I had it do a SELECT call to verify that specific readings in Japanese had not already been logged. If it was already there, it would fetch the ID so it could be paired to a headword; otherwise a new entry was added and paired thereafter.

What I noticed after I put everything in is that wherever two such similar readings occurred, the first one encountered would be logged and would then show up as a false positive for the other if it showed up. For example:

キョウ (kyou) appeared first, so characters with ギョウ (gyou) got paired with kyou instead
ズ (zu) appeared before ス (su), so likewise even more characters got incorrectly matched.

I can go through and manually sort it out if need be, but what I would really like to do is set the database up to take a stricter view regarding differentiating between characters (e.g. if the characters have two different UTF-8 code points, treat them as different characters). Is there any way to get this behavior?

Joni · Accepted Answer

You can use utf8_bin to get a collation that compares characters by their Unicode code points.

The utf8_general_ci collation also distinguishes キョウ and ギョウ.

Getting MySQL to properly distinguish Japanese characters in SELECT calls

Answers (2)

Related Questions