Jonathan Laliberte
Jonathan Laliberte

Reputation: 2725

Why are question marks replacing certain characters in mysql database with collation utf-8?

I'm using Jsoup to scrape a webpage. It takes the text and enters it directly into the database.

The text on the target webpage looks perfectly fine, but after entering it into the database i get question marks replacing certain characters.

For example the single right quotation marks (U+2019) in the following sentence:

I can’t imagine uh, a domain of human endeavor that isn’t impacted by the imagination.

Will show up like this in the database and on the webpage i'm outputting it on:

I can?t imagine uh, a domain of human endeavor that isn?t impacted by the imagination.

Initially i thought this was just a problem with the charset/collation of the database but after trying out different types, the problem persists...

The sql database i'm currently working in is in utf-8:

mysql> SHOW VARIABLES LIKE 'character\_set\_%';
+--------------------------+--------+
| Variable_name            | Value  |
+--------------------------+--------+
| character_set_client     | utf8   |
| character_set_connection | utf8   |
| character_set_database   | utf8   |
| character_set_filesystem | binary |
| character_set_results    | utf8   |
| character_set_server     | utf8   |
| character_set_system     | utf8   |
+--------------------------+--------+

And the meta is set:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />

I've tried specifically setting it in java like so:

  url = "jdbc:mysql://localhost:3306/somedb?useUnicode=true&characterEncoding=utf-8";

I've tried sql queries like:

SET NAMES 'utf8'
SET CHARACTER SET utf8

I've tried creating a new database and nothing seems to work..

Any ideas why this might be happening?

Upvotes: 1

Views: 1331

Answers (2)

Rick James
Rick James

Reputation: 142540

There are several steps to make a page work correctly.

See "question mark" in Trouble with UTF-8 characters; what I see is not what I stored

Upvotes: 0

Eritrean
Eritrean

Reputation: 16508

Jsoup automatically detects the charset for the webpage being crawled. However, many websites do not set character set encoding along with the content-type header by not defining charset.

If you crawl such webpage, where the charset attribute is missing in HTTP response Content-Type header, Jsoup parses the page using platform’s default character set. That also means that you might not get expected results as the platform’s default character set might be different from the webpage you are crawling. It might result in loss of characters or them being parsed/printed incorrectly.

To avoid such behavior you need to read the URL as InputStream and manually specify your desired character set in parse method of Jsoup as given below:

String page = "http://www.somepage.com";

//get input stream from the URL
InputStream in = new URL(page).openStream();

//parse document using input stream and specify the charset
Document doc = Jsoup.parse(in, "ISO-8859-1", page);

//..do your processing

Upvotes: 1

Related Questions