Daniel
Daniel

Reputation: 433

MySQL returns incorrect UTF8 extended characters in some cases only

Note: In the following question you may see ? or blocks instead of characters, this is because you don't have the appropriate font. Please ignore this.

Background

I have a table with data structured as follows:

CREATE TABLE `decomposition_dup` (
 `id` int(11) NOT NULL AUTO_INCREMENT,
 `parent` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 `structure` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 `child` varchar(50) COLLATE utf8mb4_unicode_ci NOT NULL,
 PRIMARY KEY (`id`),
 KEY `parent` (`parent`),
 KEY `child` (`child`),
 KEY `parent_2` (`parent`,`child`)
) ENGINE=InnoDB AUTO_INCREMENT=211929 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci

And some example data:

INSERT INTO `decomposition_dup` (`id`, `parent`, `structure`, `child`) VALUES
(154647, '锦', 'a', '钅'),
(154648, '锦', 'a', '帛'),
(185775, '钅', 'd', '二'),
(185774, '钅', 'd', '㇟'),
(21195, '钅', 'd', '𠂉'),
(21178, '⻐', 'd', '乇'),
(21177, '⻐', 'd', '𠂉');

And the charsets are all set correctly:

enter image description here

Problem

It is very important to note that:

As you can see, they are different characters. However, in some cases they are treated as the same character.

Case 1

When I run the following query, it only returns the correct child (i.e. doesn't return the similar-looking but different character child):

SELECT *
FROM decomposition_dup
WHERE parent = '锦'

enter image description here

This is correct behaviour.

Case 2

However, when I run the following query using 钅 (http://unicode.scarfboy.com/?s=%E9%92%85) it returns both the similar characters:

SELECT *
FROM decomposition_dup
WHERE parent = '钅'

enter image description here

This should only return 185775, 185774 & 21195.

Case 3

And when I run the following query using ⻐ (http://unicode.scarfboy.com/?s=%E2%BB%90) it also returns both the similar characters:

SELECT *
FROM decomposition_dup
WHERE parent = '⻐'

enter image description here

This should only return 21178 and 21177.

Case 4

If I replace = with LIKE for the broken queries (i.e. Case 2 and Case 3), they return correctly.

For example, the following query is the same as Case 3 but usingLIKE:

SELECT *
FROM decomposition_dup
WHERE parent LIKE '⻐'

enter image description here

This returns the correct characters but slows down the query.

Question

Is this a bug in MySQL or is there something that I am overlooking when querying for UTF8 extended characters?

Upvotes: 1

Views: 595

Answers (2)

Rick James
Rick James

Reputation: 142298

If you want them to be the same, set the COLLATION of the columns to utf8mb4_unicode_ci or utf8mb4_unicode_520_ci.
If you want them to be different, use utf8mb4_general_ci, instead:

mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
    ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_general_ci AS general;
+---------+
| general |
+---------+
|       0 |
+---------+

mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
    ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_ci AS unicode;
+---------+
| unicode |
+---------+
|       1 |
+---------+

mysql> SELECT CONVERT(UNHEX('e99285') USING utf8mb4) =
    ->        CONVERT(UNHEX('e2bb90') USING utf8mb4) COLLATE utf8mb4_unicode_520_ci AS unicode_520;
+-------------+
| unicode_520 |
+-------------+
|           1 |
+-------------+

Upvotes: 2

Michael
Michael

Reputation: 33

From what I can make out the problem lies within the SQL side of things upon research you'll see that this error code means that

MySQL's utf8 permits only the Unicode characters that can be represented with 3 bytes in UTF-8.

so It might be the characters you are using within the cases of SQL

Upvotes: -1

Related Questions