sanity
sanity

Reputation: 35782

Why aren't UTF-8 characters being rendered correctly in this web page (generated with JSoup)?

I'm having trouble dealing with Charsets while parsing and rendering a page using the JSoup library. here is an example of the page it renders:

http://dl.dropbox.com/u/13093/charset-problem.html

As you can see, where there should be ' characters, ? is being rendered instead (even when you view the source).

This page is being generated by downloading a web page, parsing with JSoup, and then re-rendering it again having made some structural changes.

I'm downloading the page as follows:

final Document inputDoc = Jsoup.connect(sourceURL.toString()).get();

When I create the output document I do so as follows:

outputDoc.outputSettings().charset(Charset.forName("UTF-8"));
outputDoc.head().appendElement("meta").attr("charset", "UTF-8");
outputDoc.head().appendElement("meta").attr("http-equiv", "Content-Type")
            .attr("content", "text/html; charset=UTF-8");

Can anyone offer suggestions as to what I'm doing wrong?

edit: Note that the source page is http://blog.locut.us/ and as you'll see, it appears to render correctly

Upvotes: 1

Views: 1435

Answers (2)

BalusC
BalusC

Reputation: 1109142

The question marks are typical whenever you write characters to the outputstream of the response which are not covered by the response's character encoding. You seem to be relying on the platform default character encoding when serving the response. The response Content-Type header of your site also confirms this by a missing charset attribute.

Assuming that you're using a servlet to serve the modified HTML, then you should be using HttpServletResponse#setCharacterEncoding() to set the character encoding before writing the modified HTML out.

response.setCharacterEncoding("UTF-8");
response.getWriter().write(html);

Upvotes: 4

Leonard Brünings
Leonard Brünings

Reputation: 13242

The problem is most likely in reading the input page, you need to have the correct encoding for the source too.

Upvotes: 0

Related Questions