MySQL query returns wrong data when column contains utf-8 characters

Question

I have some MySQL table with column 'title' which type is varchar(255), character set is utf8mb4 and collation is utf8mb4_general_ci.

Lets say I have few records with title and those titles contain (or not) diactrics:

id | title
-----------
1  | zolc
2  | żółć
3  | żołc
4  | zólć

I can correctly insert those diactrics and also they are properly displayed when table is selected. But when I try something like this:

SELECT * 
FROM my_table 
WHERE title LIKE "%zolc%";

I got:

id | title
-----------
1  | zolc
4  | zólć

As you see I asked for version without any diactricts, but also got row with id 4. Selecting żółć returns rows with ids 2 (as expected) and 3. Querying for zołć returns rows 2 and 3, where I would expect that nothing will be returned. There is many combination like this, where some "wrong" rows are returned after query (I tried also with ą and ę and they also act strange).

At first I thought that it is problem with configuration of my technological stack (java web application on top of Spring Boot) but I got exactly the same results when executing queries from MySQL Workbench on local db on Windows machine and by executing queries by ssh to remote db running on Ubuntu machine. There is also no difference if query is done using title LIKE "value" or with WHERE title = "value".

I couldn't find explanation for this - note that this does not simply returns all rows that "match" query parameter but without special characters. I'm struggling to enable search by title but I would like to it be 1:1, so when I use "ż" in my query parameter only rows where "ż" is actually present will be returned.

Thanks in advance for any help.

&#193;lvaro Gonz&#225;lez · Accepted Answer

Your query will use the table/column collation and since that collation considers all those characters equivalent you don't really ask for the values you think. Your choices are to either use proper cultural settings (e.g. utf8mb4_polish_ci) or use none (e.g. utf8mb4_bin). Which option to choose depends on your use case but both are probably better than just using some arbitrary settings: utf8mb4_general_ci is a kind of one size fits all collation designed for speed rather than correctness.

It's also worth nothing that MySQL allows setting collation at different levels:

Table
Column
Specific string

Once more, which one to choose will depend on your specific needs. Here's a little example of the last case (the other ones are straightforward):

SELECT
CASE WHEN 'zolc' COLLATE utf8mb4_general_ci ='zólć' THEN 'equal' ELSE 'different' END AS General,
CASE WHEN 'zolc' COLLATE utf8mb4_unicode_ci ='zólć' THEN 'equal' ELSE 'different' END AS Unicode,
CASE WHEN 'zolc' COLLATE utf8mb4_polish_ci ='zólć' THEN 'equal' ELSE 'different' END AS Polish,
CASE WHEN 'zolc' COLLATE utf8mb4_bin ='zólć' THEN 'equal' ELSE 'different' END AS BinaryCollation,
CASE WHEN BINARY 'zolc'='zólć' THEN 'equal' ELSE 'different' END AS BinaryOperator;

General | Unicode | Polish    | BinaryCollation | BinaryOperator 
------- | ------- | --------- | --------------- | -------------- 
equal   | equal   | different | different       | different

(I've assumed the text is in Polish, sorry if it's not.)

MySQL query returns wrong data when column contains utf-8 characters

Answers (2)

Related Questions