Hillcow
Hillcow

Reputation: 969

Invalid utf8 character string. Converting "latin1_german1_ci" column entirely into UTF8

I have a table with a column that has data that seems not to be UTF8. I want to convert that column into UTF8.

I found this wonderful tutorial: https://coderwall.com/p/gjyuwg/mysql-convert-encoding-to-utf8-without-garbled-data

However, none of these solutions really work.

When I do

UPDATE vbpmtext 
SET message = @txt 
WHERE char_length(message) =  LENGTH(@txt := CONVERT(BINARY CONVERT(message USING latin1) USING utf8));

I get a ton of errors like this:

Invalid utf8 character string: 'FC6265'

with different "strings" (FC6265 is just an example).

Is there any way I can rescue this data?

The column we are speaking of is naturally formatted with a collation of latin1_german1_ci.

Upvotes: 0

Views: 1097

Answers (2)

Hillcow
Hillcow

Reputation: 969

Okay, so this was my fault. It wasn't that the data itself was corrupted, but I was using php substr method on the data which cut the data at unfortunate places. A solution was found here: php substr() function with utf-8 leaves � marks at the end

$var = "Бензин Офиси А.С. также производит все типы жира и смазок и их побочных        продуктов в его смесительных установках нефти машинного масла в Деринце, Измите, Алиага и Измире. У Компании есть 3 885 станций технического обслуживания, включая сжиженный газ (ЛПГ) станции под фирменным знаком Петрогаз, приблизительно 5 000 дилеров, двух смазочных смесительных установок, 12 терминалов, и 26 единиц поставки аэропорта.";

$foo = mb_substr($var,0,142, "utf-8");

Upvotes: 0

Rick James
Rick James

Reputation: 142218

Interpreted as latin1, FC6265 is übe. (Ditto for cp1250, cp1256, cp1257, dec8, latin2, latin5, latin7.)

Was @txt the 3-character string übe? Or the 6-character string FC6265?

mysql> SET @in := UNHEX('FC6265');
mysql> SELECT HEX(@in);
+----------+
| HEX(@in) |
+----------+
| FC6265   |
+----------+
mysql> SELECT HEX( CONVERT(@in USING latin1) );
+----------------------------------+
| HEX( CONVERT(@in USING latin1) ) |
+----------------------------------+
| FC6265                           |
+----------------------------------+
mysql> SELECT HEX( BINARY(CONVERT(@in USING latin1)) );
+------------------------------------------+
| HEX( BINARY(CONVERT(@in USING latin1)) ) |
+------------------------------------------+
| FC6265                                   |
+------------------------------------------+
mysql> SELECT HEX( CONVERT(BINARY CONVERT(@in USING latin1) USING utf8) );
+-------------------------------------------------------------+
| HEX( CONVERT(BINARY CONVERT(@in USING latin1) USING utf8) ) |
+-------------------------------------------------------------+
|                                                             |
+-------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> SHOW WARNINGS;
+---------+------+-----------------------------------------+
| Level   | Code | Message                                 |
+---------+------+-----------------------------------------+
| Warning | 1300 | Invalid utf8 character string: 'FC6265' |
+---------+------+-----------------------------------------+

Using BINARY() takes away any assumption of what the string is currently encoded as. So it takes the simplest approach and assumes the string is already utf8.

This is probably the shortest way:

mysql> SELECT CONVERT(CONVERT(@in USING latin1) USING utf8);
+-----------------------------------------------+
| CONVERT(CONVERT(@in USING latin1) USING utf8) |
+-----------------------------------------------+
| übe                                           |
+-----------------------------------------------+

But... what is the charset of the column? If it is latin1, you are making a worse mess. Do not do any changes without first testing further. Here are several cases and the fix for each. Do not rush into one solution before figuring out if you have that case; you are likely to make things worse. See also Trouble with UTF-8 characters; what I see is not what I stored

Example

mysql> CREATE TABLE ube ( c VARCHAR(8) CHARSET latin1 COLLATE latin1_german1_ci );

mysql> INSERT INTO ube (c) VALUES (UNHEX('FC6265'));

mysql> SELECT HEX(c) FROM ube;
+--------+
| HEX(c) |
+--------+
| FC6265 |  -- Note the latin1 encoding
+--------+

mysql> ALTER TABLE ube CONVERT TO CHARACTER SET utf8mb4;
Query OK, 1 row affected (0.04 sec)
Records: 1  Duplicates: 0  Warnings: 0   -- Note: no errors

mysql> SELECT HEX(c) FROM ube;
+----------+
| HEX(c)   |
+----------+
| C3BC6265 |   -- Now utf8mb4 encoding
+----------+

Upvotes: 2

Related Questions