AdrianK
AdrianK

Reputation: 66

MySQL Load data infile UTF8 encoding

I'm trying to insert data from a utf8 encoded text file into a utf8 encoded mysql database.

A line in the textfile I am trying to insert looks like this:

<http://dbpedia.org/resource/Canada_(New_France)> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:History_of_Canada_(1534–1763)> .

To load the file into the database I use this command:

LOAD DATA LOCAL INFILE 'path/to/file'
INTO TABLE table_name
CHARACTER SET utf8
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(s, @p, o)
COMMIT;

The create table command looks like this:

 CREATE TABLE `article_categories3` (`s` varchar(511) NOT NULL, `o` varchar(511) NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=utf8

Unfortunately the resulting entry in the database looks like this:

| <http://dbpedia.org/resource/Canada_(New_France)>       | <http://dbpedia.org/resource/Category:History_of_Canada_(1534û1763)> |

'-' is getting replaced by 'û'. The same happens for other special characters, like 'ï' gets replaced by '´'.

mysql shows these charsets:

Server characterset:    utf8
Db     characterset:    utf8
Client characterset:    latin1
Conn.  characterset:    latin1
character_set_results: latin1

changing client and connection charset and character_set_results to utf8 with the command \C utf8 results in:

<http://dbpedia.org/resource/Category:History_of_Canada_(1534ÔÇô1763)>
<http://dbpedia.org/resource/Category:Oïl_languages>

changing them to cp850 results in:

<http://dbpedia.org/resource/Category:History_of_Canada_(1534?1763)>
<http://dbpedia.org/resource/Category:Oïl_languages>

So with this setting 'ï' is shown correct.

I also tried writing a little java program, which inserts the lines from the file into the database, but I encounter the same result with this approach.

But if I manually type an insert command into the mysql-console, it will get inserted correct:

 INSERT INTO table_name VALUES ('<http://dbpedia.org/resource/Canada_(New_France)>', '<http://dbpedia.org/resource/Category:History_of_Canada_(1534–1763)>');

The Hex codes of the entries are:

3C687474703A2F2F646270656469612E6F72672F7265736F757263652F43617465676F72793A486973746F72795F6F665F43616E6164615F2831353334E2809331373633293E

(endash here E28093)

and

3C687474703A2F2F646270656469612E6F72672F7265736F757263652F43617465676F72793A4FC3AF6C5F6C616E6775616765733E

I checked the encoding of the text file like this:

file -i test.ttl
test.ttl: text/plain; charset=utf-8

Do you have an idea, what else i could try to load the file into the database with the correct encoding?

Upvotes: 2

Views: 3423

Answers (1)

Rick James
Rick James

Reputation: 142528

You seem to have the character set cp850 somewhere in the flow of data.

Please provide

  • SHOW CREATE TABLE
  • The connection parameters when connecting to MySQL.

Also verify that the incoming data is actually encoded UTF-8.

Get the HEX() of that column -- û is hex C3BB; endash and emdash are E28093 and E28094 in utf8. Something else is going on.

More

All three of these need to be set the same: character_set_client, character_set_connection, and character_set_results.

Upvotes: 1

Related Questions