Brandon Buck
Brandon Buck

Reputation: 7181

How do I avoid losing punctuation when pulling data from a MySQL database with JDBC?

First things first, I'm using:

Java 1.7.0_02
MySQL 5.1.50
ZendServer CE (if that matters)

The JDBC driver I'm using to connect to MySQL from Java is com.mysql.jdbc.Driver. The connection to the Database works fine.

My connection string is:

jdbc:mysql://localhost:3306/table

And in attempts to solve the issue I'm having I've added

?useUnicode=true&characterEncoding=UTF-8 

to the connection string.

I'm working with a Wikipedia dump, all the text is in the MediaWiki format and I'm parsing the content with JWPL which is working beautifully for me, and in the process of pulling from the database, parsing, and displaying via HTML I'm losing characters like '-' and single quotes, and instead getting Earth���s instead of Earth's.

After some testing I've boiled down that the characters are not being encoded/decoded properly somehwere between the MySQL query and processing the String in Java, I've come to this conclusion because the text in the Database (stored as a MEDIUMBLOB) has the correct characters, like it should, and the immediate output of the String in Java after the DB call has broken/missing characters ('?????' instead of Japanese characters, etc.).

I've verified that System.getProperty("file.encoding"); is UTF-8 so the JVM should be encoding the String when printing properly (unless there is something wrong with the JVM's UTF-8 > UTF-16 > UTF-8 conversion.

I've also created a UTF-8 table with UTF-8 columns and moved the Data to it in the Database for testing which solved nothing. Another attempted fix was replace the:

return result.getString("old_text");

which pulls the text from the Result set to:

return new String(result.getString("old_text").getBytes("utf8"), "utf8");

which gave me the same results as the previous statement.

Is there away to avoid this loss of character data when accessing MySQL with JDBC, if not, is there a way I can process the characters and recover the proper character for display purposes? Two and Three random character blocks in place of standard punctuation kind of breaks the user experience.

EDIT

A small note, the data in the Database is fine - the characters are present, all of them, and visible. Accessing the date thruogh phpMyAdmin returns the data with the properly encoded characters. The issue is arising somewhere between MySQL and Java, perhaps with the JDBC. I'm seeking a setting or a workaround (that works, as the ones that I have tried have not worked for me) that will prevent the loss of those character codes.

Upvotes: 4

Views: 572

Answers (2)

Brandon Buck
Brandon Buck

Reputation: 7181

After some research and reading I've come to find a solution that fixed the issues I was having. I can't say for why but it seems to have been in converting a MEDIUMBLOB into a String type in Java.

This is how I was returning text from the result:

if (result.next())
    return result.getString("old_text");
else
    return null;

I haven't done a lot with JDBC in the past and wasn't aware there was a Blob class, so I altered the code to:

if (result.next()) {
    Blob blob = result.getBlob("old_text");
    InputStream is = blob.getBinaryStream();
    byte[] bytes = new byte[is.available()];
    is.read(bytes);
    is.close();

    return new String(bytes, "UTF-8");
}
else
    return null;

And it works beautifully.

Upvotes: 1

Stephen C
Stephen C

Reputation: 718826

I think that the issue has to be in the way that you are encoding and decoding the bytes in the Blob. And it is probably because the default charset is not what you think it is.

I'd recommend that you get and put byte arrays, and that you specify the UTF-8 charset explicitly when converting strings to byte arrays and back again. Don't rely on assumptions about the default charset.

FWIW, the correct way to find out what the JVM's default charset is is to look at the object returned by Charset.defaultCharset().

Upvotes: 0

Related Questions