user3319710
user3319710

Reputation: 29

HTML files with no http-equiv meta tag and the charset may be other than UTF-8

we are using jsoup - excellent thanks.

We may get HTML files with no http-equiv meta tag and the charset may be other than UTF-8. How is it best to handle this please. We can have a list of encodings and try them but I am not sure how to tell programatically if something is wrong. Would jsoup throw an IOException?

Upvotes: 0

Views: 330

Answers (1)

ollo
ollo

Reputation: 25340

Jsoup will try to determine the encoding by the content type header or http equiv tag, if you have none of them it will use utf8. Not sure if jsoup can do more for you here.

But you can try another approach:

Implement a class that reads the files for you. There you can take care of all encoding issues. As a result such a class should give you proper encoded string or at least the encoding that's used for your input.

(html input) --> [encoding class] --normalized encoding--> [jsoup] --> (whatever)   

Jsoup can now parse that input with a known encoding.

I guess changes on the html-creation thing is not possible, isn't it?

Some further readings:

Upvotes: 0

Related Questions