Kaji
Kaji

Reputation: 2330

Getting MySQL to properly distinguish Japanese characters in SELECT calls

I'm setting up a database to do some linguistic analysis, and Japanese Kana are giving me just a bit of trouble.

Unlike other questions on this so far, I don't know that it's an encoding issue, per se. I've set the coallation to utf8_unicode_ci, and on the surface it's saving and recalling most things all right.

The problem, however, is when I get into related kana, such as キ (ki) and ギ (gi). For sorting purposes, Japanese doesn't distinguish between the two unless they are in direct conflict. So for example:

It's this behavior that I think is at the root of my problem. When loading my data set from an external file, I had it do a SELECT call to verify that specific readings in Japanese had not already been logged. If it was already there, it would fetch the ID so it could be paired to a headword; otherwise a new entry was added and paired thereafter.

What I noticed after I put everything in is that wherever two such similar readings occurred, the first one encountered would be logged and would then show up as a false positive for the other if it showed up. For example:

I can go through and manually sort it out if need be, but what I would really like to do is set the database up to take a stricter view regarding differentiating between characters (e.g. if the characters have two different UTF-8 code points, treat them as different characters). Is there any way to get this behavior?

Upvotes: 0

Views: 262

Answers (2)

Joni
Joni

Reputation: 111219

You can use utf8_bin to get a collation that compares characters by their Unicode code points.

The utf8_general_ci collation also distinguishes キョウ and ギョウ.

Upvotes: 2

Yamen Nassif
Yamen Nassif

Reputation: 2476

when saving to database save it as binary and when calling back change it to Japanese same problem accorded with me with Arabic language

Upvotes: 1

Related Questions