Reputation: 969
I have a table with a column that has data that seems not to be UTF8. I want to convert that column into UTF8.
I found this wonderful tutorial: https://coderwall.com/p/gjyuwg/mysql-convert-encoding-to-utf8-without-garbled-data
However, none of these solutions really work.
When I do
UPDATE vbpmtext
SET message = @txt
WHERE char_length(message) = LENGTH(@txt := CONVERT(BINARY CONVERT(message USING latin1) USING utf8));
I get a ton of errors like this:
Invalid utf8 character string: 'FC6265'
with different "strings" (FC6265 is just an example).
Is there any way I can rescue this data?
The column we are speaking of is naturally formatted with a collation of latin1_german1_ci
.
Upvotes: 0
Views: 1097
Reputation: 969
Okay, so this was my fault. It wasn't that the data itself was corrupted, but I was using php substr method on the data which cut the data at unfortunate places. A solution was found here: php substr() function with utf-8 leaves � marks at the end
$var = "Бензин Офиси А.С. также производит все типы жира и смазок и их побочных продуктов в его смесительных установках нефти машинного масла в Деринце, Измите, Алиага и Измире. У Компании есть 3 885 станций технического обслуживания, включая сжиженный газ (ЛПГ) станции под фирменным знаком Петрогаз, приблизительно 5 000 дилеров, двух смазочных смесительных установок, 12 терминалов, и 26 единиц поставки аэропорта.";
$foo = mb_substr($var,0,142, "utf-8");
Upvotes: 0
Reputation: 142218
Interpreted as latin1, FC6265
is übe
. (Ditto for cp1250, cp1256, cp1257, dec8, latin2, latin5, latin7.)
Was @txt
the 3-character string übe
? Or the 6-character string FC6265
?
mysql> SET @in := UNHEX('FC6265');
mysql> SELECT HEX(@in);
+----------+
| HEX(@in) |
+----------+
| FC6265 |
+----------+
mysql> SELECT HEX( CONVERT(@in USING latin1) );
+----------------------------------+
| HEX( CONVERT(@in USING latin1) ) |
+----------------------------------+
| FC6265 |
+----------------------------------+
mysql> SELECT HEX( BINARY(CONVERT(@in USING latin1)) );
+------------------------------------------+
| HEX( BINARY(CONVERT(@in USING latin1)) ) |
+------------------------------------------+
| FC6265 |
+------------------------------------------+
mysql> SELECT HEX( CONVERT(BINARY CONVERT(@in USING latin1) USING utf8) );
+-------------------------------------------------------------+
| HEX( CONVERT(BINARY CONVERT(@in USING latin1) USING utf8) ) |
+-------------------------------------------------------------+
| |
+-------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)
mysql> SHOW WARNINGS;
+---------+------+-----------------------------------------+
| Level | Code | Message |
+---------+------+-----------------------------------------+
| Warning | 1300 | Invalid utf8 character string: 'FC6265' |
+---------+------+-----------------------------------------+
Using BINARY()
takes away any assumption of what the string is currently encoded as. So it takes the simplest approach and assumes the string is already utf8
.
This is probably the shortest way:
mysql> SELECT CONVERT(CONVERT(@in USING latin1) USING utf8);
+-----------------------------------------------+
| CONVERT(CONVERT(@in USING latin1) USING utf8) |
+-----------------------------------------------+
| übe |
+-----------------------------------------------+
But... what is the charset of the column? If it is latin1, you are making a worse mess. Do not do any changes without first testing further. Here are several cases and the fix for each. Do not rush into one solution before figuring out if you have that case; you are likely to make things worse. See also Trouble with UTF-8 characters; what I see is not what I stored
Example
mysql> CREATE TABLE ube ( c VARCHAR(8) CHARSET latin1 COLLATE latin1_german1_ci );
mysql> INSERT INTO ube (c) VALUES (UNHEX('FC6265'));
mysql> SELECT HEX(c) FROM ube;
+--------+
| HEX(c) |
+--------+
| FC6265 | -- Note the latin1 encoding
+--------+
mysql> ALTER TABLE ube CONVERT TO CHARACTER SET utf8mb4;
Query OK, 1 row affected (0.04 sec)
Records: 1 Duplicates: 0 Warnings: 0 -- Note: no errors
mysql> SELECT HEX(c) FROM ube;
+----------+
| HEX(c) |
+----------+
| C3BC6265 | -- Now utf8mb4 encoding
+----------+
Upvotes: 2