Reputation: 21

jsoup to w3c-document: INVALID_CHARACTER_ERR

My usecase: Get html-pages by jsoup and returns a w3c-DOM for further processing by XML-transformations:

...
org.jsoup.nodes.Document document = connection.get();
org.w3c.dom.Document dom = new W3CDom().fromJsoup(document);
...

Works well for most documents but for some it throws INVALID_CHARACTER_ERR without telling where.

It seems extremely difficult to find the error. I changed the code to first import the url to a String and then checking for bad characters by regexp. But that does not help for bad attributes (eg. without value) etc.

My current solution is to minimize the risk by removing elements by tag in the jsoup-document (head, img, script ...).

Is there a more elegant solution?

Upvotes: 2

Answers (2)

Stephan

Reputation: 43013

Solution found by OP in reply to nyname00:

Thank you very much; this solved the problem:
Whitelist whiteList = Whitelist.relaxed();
Cleaner cleaner = new Cleaner(whiteList);
jsoupDom = cleaner.clean(jsoupDom);
"relaxed" in deed means relaxed developer...

Upvotes: 1

nyname00

Reputation: 2566

Try setting the outputSettings to 'XML' for your document:

document
  .outputSettings()
  .syntax(OutputSettings.Syntax.xml);

document 
    .outputSettings()
    .charset("UTF-8");

This should ensure that the resulting XML is valid.

Upvotes: 1

jsoup to w3c-document: INVALID_CHARACTER_ERR

Answers (2)

Related Questions