Reputation: 66
I'm trying to insert data from a utf8 encoded text file into a utf8 encoded mysql database.
A line in the textfile I am trying to insert looks like this:
<http://dbpedia.org/resource/Canada_(New_France)> <http://purl.org/dc/terms/subject> <http://dbpedia.org/resource/Category:History_of_Canada_(1534–1763)> .
To load the file into the database I use this command:
LOAD DATA LOCAL INFILE 'path/to/file'
INTO TABLE table_name
CHARACTER SET utf8
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(s, @p, o)
COMMIT;
The create table command looks like this:
CREATE TABLE `article_categories3` (`s` varchar(511) NOT NULL, `o` varchar(511) NOT NULL) ENGINE=InnoDB DEFAULT CHARSET=utf8
Unfortunately the resulting entry in the database looks like this:
| <http://dbpedia.org/resource/Canada_(New_France)> | <http://dbpedia.org/resource/Category:History_of_Canada_(1534û1763)> |
'-' is getting replaced by 'û'. The same happens for other special characters, like 'ï' gets replaced by '´'.
mysql shows these charsets:
Server characterset: utf8
Db characterset: utf8
Client characterset: latin1
Conn. characterset: latin1
character_set_results: latin1
changing client and connection charset and character_set_results to utf8
with the command \C utf8
results in:
<http://dbpedia.org/resource/Category:History_of_Canada_(1534ÔÇô1763)>
<http://dbpedia.org/resource/Category:Oïl_languages>
changing them to cp850
results in:
<http://dbpedia.org/resource/Category:History_of_Canada_(1534?1763)>
<http://dbpedia.org/resource/Category:Oïl_languages>
So with this setting 'ï' is shown correct.
I also tried writing a little java program, which inserts the lines from the file into the database, but I encounter the same result with this approach.
But if I manually type an insert command into the mysql-console, it will get inserted correct:
INSERT INTO table_name VALUES ('<http://dbpedia.org/resource/Canada_(New_France)>', '<http://dbpedia.org/resource/Category:History_of_Canada_(1534–1763)>');
The Hex codes of the entries are:
3C687474703A2F2F646270656469612E6F72672F7265736F757263652F43617465676F72793A486973746F72795F6F665F43616E6164615F2831353334E2809331373633293E
(endash here E28093
)
and
3C687474703A2F2F646270656469612E6F72672F7265736F757263652F43617465676F72793A4FC3AF6C5F6C616E6775616765733E
I checked the encoding of the text file like this:
file -i test.ttl
test.ttl: text/plain; charset=utf-8
Do you have an idea, what else i could try to load the file into the database with the correct encoding?
Upvotes: 2
Views: 3423
Reputation: 142528
You seem to have the character set cp850
somewhere in the flow of data.
Please provide
SHOW CREATE TABLE
Also verify that the incoming data is actually encoded UTF-8
.
Get the HEX()
of that column -- û
is hex C3BB
; endash and emdash are E28093
and E28094
in utf8. Something else is going on.
More
All three of these need to be set the same: character_set_client
, character_set_connection
, and character_set_results
.
Upvotes: 1