Eliran
Eliran

Reputation: 207

Converting non-utf8 database to utf-8

I've been using for a long time a database/connection with the wrong encoding, resulting the hebrew language characters in the database to display as unknown-language characters, as the example shows below:

enter image description here

I want to re-import/change the database with the inserted-wrong-encoded characters to the right encoded characters, so the hebrew characters will be displayed as hebrew characters and not as unknown parse like *"× ×תה מסכי×,×× ×©×™× ×ž×¦×™×¢×™× ×œ×™ כמה ×”× "*

For the record, when I display this unknown characters sql data with php - it shows as hebrew. when I'm trying to access it from the phpMyAdmin Panel - it shows as jibrish (these unknown characters).

Is there any way to fix it although there is some data already inserted in the database?

Upvotes: 0

Views: 837

Answers (2)

Rick James
Rick James

Reputation: 142538

That feels like "double-encoded" Hebrew strings.

This partially recovers the text:

UNHEX(HEX(CONVERT('× ×תה מסכי×,××' USING latin1))) 
-->  '� �תה מסכי�,�� 

I do not know what leads to the symbols.

Please do SELECT col, HEX(col) FROM ... WHERE ...; for some cell. I would expect שלום to give hex D7A9D79CD795D79D if it were correctly stored. For "double encoding", I would expect C397C2A9C397C593C397E280A2C397C29D.

Please provide the output from that SELECT, then I will work on how to recover the data.

Edit

Here's what I think happened.

  • The client had characters encoded as utf8; and
  • SET NAMES latin1 lied by claiming that the client had latin1 encoding; and
  • The column in the table declared CHARACTER SET utf8.

Yod did not jump out as a letter, so it took a while to see it. CONVERT(BINARY(CONVERT('×™×™123' USING latin1)) USING utf8) -->יי123

So, I am thinking that that expression will clean up the text. But be cautious; try it on a few rows before 'fixing' the entire table.

UPDATE table SET col = CONVERT(BINARY(CONVERT(col USING latin1)) USING utf8) WHERE ...;

If that does not work, here are 4 fixes for double-encoding that may or may not be equivalent. (Note: BINARY(xx) is probably the same as CONVERT(xx USING binary).)

Upvotes: 1

BK435
BK435

Reputation: 3176

I am not sure that you can do anything about the data that has already been stored in the database. However, you can import hebrew data properly by making sure you have the correct character set and collation.

  1. the db collation has to be utf8_general_ci
  2. the collation of the table with hebrew has to be utf8_general_ci

for example:

CREATE DATABASE col CHARACTER SET utf8 COLLATE utf8_general_ci;

CREATE TABLE `col`.`hebrew` (
  `id` INT NOT NULL AUTO_INCREMENT,
  `heb` VARCHAR(45) NOT NULL,
  PRIMARY KEY (`id`)
  ) CHARACTER SET utf8
  COLLATE utf8_general_ci;

INSERT INTO hebrew(heb) values ('שלום');

enter image description here

Upvotes: 0

Related Questions