Reputation: 528
I've looked all over the web only to find ways to make searches insensitive to diacritic characters, but I need the opposite.
In my situation I need to be able to compare specifically symbols with superscript and subscript dots (i.e. ȧ & ạ) and some other more common acents (á, ã, etc), but these letters might be anything (ṡ, ṛ, ṫ, ḍ, ṅ, etc). The desired result would work like this: if I search for "a" I receive "a" only, and if I search for "ȧ", I receive "ȧ" only as a result and not "a" along with it (without the dot).
I've read that I need to use utf8_bin and have tried changing both my field collations, table collations and database collations to that with no success. Here's the code:
// "sound" is being passed in by an AJAX call
$sound = $_POST['sound'];
$query = "SELECT * FROM sounds WHERE 'sound' = '$sound'";
$result = mysql_query($query);
// This is then sent back to my page.
I've also looked into COLLATE with little success. I'm probably misunderstanding its prober usage:
// Attempting to covert the searched string into the utf8_bin format to match my db collations
$query = "SELECT * FROM sounds WHERE 'sound' = '$sound' COLLATE utf8_bin";
When I use utf8_general_ci or utf8_unicode_ci I get the excepted result of "a" or "ȧ" returning both "ȧ" and "a". However, If I use utf8_bin I get nothing when searching for either of these. I believe this is because in my database when using utf8_bin this - "ṅ(PH)" (one of my entries) - gets converted to this - "e1b98528504829". So Is there a way of converting my searches to that same format before querying them? Or just an all around better way of making this work?
Thank you!
Upvotes: 3
Views: 468
Reputation: 528
Okay, with a little help from a friend I got it working. Turns out it works just fine with utf8_general_ci collations as well.
My first issue was with how I had entered my data into my database. I had used phpMyAdmin to do this, which for some reason didn't encode the data properly and all my bins were turning out wrong. This was fixed by just writing my own sql to enter the values.
Second, I ended up using the PHP function iconv() to encode the data coming from the web page. These two solutions put together got matching values and the entire script works great.
Thanks all for the help and suggestions... really appreciated, and believe me, did not go to waste. I spent a long time fiddling with all of them.
Cheers!
Upvotes: 0
Reputation: 234857
My guess is that your data is not normalized. In order to use the utf8_bin collating sequence, you need to be working with normalized data. Both the data in the data base and the data in the query need to be normalized.
The byte sequence e1 b9 85
is the UTF-8 encoding of LATIN SMALL LETTER N WITH DOT ABOVE (U+1E45), but this can be decomposed into LATIN SMALL LETTER N (U+006E) + COMBINING DOT ABOVE (U+0307). The UTF-8 encoding of the decomposition would be 6e cc 87
. The utf8_general_ci and utf8_unicode_ci collation sequences take care of this automatically, but utf8_bin does not.
On a separate note--you should not be constructing the query by directly interpolating $sound
. This opens up a huge security hole in your system by making it vulnerable to SQL injection attacks. Instead, use a prepared statement and parameter binding. (The php docs have an example of how to do this.)
Upvotes: 2