James
James

Reputation: 15475

Unicode issue with an HTML Title, question mark? 65533;

I'm trying to parse the title from the following webpage: http://kid37.blogger.de/stories/1670573/

When I use the apache.commons.lang StringEscapeUtils.escapeHTML method on the title element I get the following

Das hermetische Caf�: Rock & Wrestling 2010

however when I display that in my webpage with utf-8 encoding it just shows a question mark.

Using the following code:

String title = StringEscapeUtils.escapeHtml(myTitle);

If I run the title through this website: http://tools.devshed.com/?option=com_mechtools&tool=27 I get the following output which seems correct

TITLE:

<title>Das hermetische Café: Rock &amp; Wrestling 2010</title>

BECOMES (which I was expecting the escapeHtml method to do):

<title>Das hermetische Caf&eacute;: Rock &amp; Wrestling 2010</title>

any ideas? thanks

Upvotes: 23

Views: 53524

Answers (2)

nikhiljoshister
nikhiljoshister

Reputation: 1

These decoders(charset) attribute could also be used in java Stream readers such as InputStreamReader as it has its own constructors to allow them what kind of characters that are entering stream. Agree with the answer Erickson gave.

Upvotes: 0

erickson
erickson

Reputation: 269657

U+FFFD (decimal 65533) is the "replacement character". When a decoder encounters an invalid sequence of bytes, it may (depending on its configuration) substitute � for the corrupt sequence and continue.

One common reason for a "corrupt" sequence is that the wrong decoder has been applied. For example, the decoder might be UTF-8, but the page is actually encoded with ISO-8859-1 (the default if another is not specified in the content-type header or equivalent).

So, before you even pass the string to escapeHtml, the "é" has already been replaced with "�"; the method encodes this correctly.

The page in question uses ISO-8859-1 encoding. Make sure that you are using that decoder when converting the fetched resource to a String.

Upvotes: 55

Related Questions