Pascal Mathys
Pascal Mathys

Reputation: 609

Strange encoding behaviour with jsoup

I extract some information from the html sourcecode of different pages with jsoup. Most of them are UTF-8 encoded. One of them is encoded with ISO-8859-1, which leads to a strange error (in my optinion).

The page that contains the error is: http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html

I read the needed String with the following piece of code:

Document doc = Jsoup.connect("http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html").userAgent("Mozilla").get();
String title = doc.getElementsByClass("products_name").first().text();

The problem is the hyphen in the String "HD Armbanduhr aus Metall 4GB Wasserdicht 1280X960 – 5 Megapixels". Normal umlauts like öäü are read correctly. Only this single character, which is not outputed as "& #45;" makes the problem.

I tried to override the (correctly set) page-encoding with out.outputSettings().charset("ISO-8859-1") but that didn't help either.

Next, i tried do change the encoding of the string with the Charset class from and to utf8 and iso-8859-1 manually. Also no luck.

Has someone a tip on what i can try to get the correct character after parsing the html document with jsoup?

Thanks

Upvotes: 2

Views: 9168

Answers (1)

BalusC
BalusC

Reputation: 1108642

This is a mistake of the website itself. It are actually three mistakes:

  1. The page is served without any charset in the HTTP Content-Type response header. There's ISO-8859-1 in the HTML meta tag, but this is ignored when the page is served over HTTP! The average webbrowser will either try smart detection or use platform default encoding to encode the webpage, which is CP1252 on Windows machines.

  2. The <meta> tag pretends that the content is ISO-8859-1 encoded, but the actual character (U+2013 EN DASH) is not covered by that charset at all. It is however covered by the CP1252 charset as 0x0096.

  3. According to the webpage source code, the product name uses the literal character instead of the HTML entity &ndash; as spotted elsewhere on the same webpage.

Jsoup can fix many badly developed webpages transparently, but this one goes really beyond Jsoup. You need to manually read it in and then feed it as CP1252 to Jsoup.

String url = "http://www.gudi.ch/armbanduhr-metall-wasserdicht-1280x960-megapixels-p-560.html";
InputStream input = new URL(url).openStream();
Document doc = Jsoup.parse(input, "CP1252", url);
String title = doc.select(".products_name").first().text();
// ...

Upvotes: 7

Related Questions