Reputation: 21
My usecase: Get html-pages by jsoup and returns a w3c-DOM for further processing by XML-transformations:
...
org.jsoup.nodes.Document document = connection.get();
org.w3c.dom.Document dom = new W3CDom().fromJsoup(document);
...
Works well for most documents but for some it throws INVALID_CHARACTER_ERR without telling where.
It seems extremely difficult to find the error. I changed the code to first import the url to a String and then checking for bad characters by regexp. But that does not help for bad attributes (eg. without value) etc.
My current solution is to minimize the risk by removing elements by tag in the jsoup-document (head, img, script ...).
Is there a more elegant solution?
Upvotes: 2
Views: 449
Reputation: 43013
Solution found by OP in reply to nyname00:
Thank you very much; this solved the problem:
Whitelist whiteList = Whitelist.relaxed(); Cleaner cleaner = new Cleaner(whiteList); jsoupDom = cleaner.clean(jsoupDom);
"relaxed" in deed means relaxed developer...
Upvotes: 1
Reputation: 2566
Try setting the outputSettings
to 'XML' for your document:
document
.outputSettings()
.syntax(OutputSettings.Syntax.xml);
document
.outputSettings()
.charset("UTF-8");
This should ensure that the resulting XML is valid.
Upvotes: 1