Kirinkunhime
Kirinkunhime

Reputation: 105

Quotation-Marks causing IllegalNameException when parsing HTML with JDom2

Good Evening everyone!

I'm trying to parse a HTML-page in Java with JDOM2, to access some information from it.

My code looks like this: (Just added the packages for this codeblock, don't have them in my real source)

//Here goes the reading of the site into my String "string" (using NekoHTML)

org.xml.sax.InputSource is = new InputSource();
is.setCharacterStream(new StringReader(string));

org.cyberneko.html.parsers.DOMParser parser = new DOMParser();
parser.parse(is);

org.jdom2.input.DOMBuilder builder = new DOMBuilder();
org.jdom2.Document doc = builder.build(parser.getDocument());

This works fine for everything except some special case: When the site contains quotation-Marks within an element. Here is an example of what I mean:

<a href="LINK" title="Der "realismo mágico" und die Phantastische Literatur">Der "realismo mágico" und die Phantastische...</a>

So, after that wonderful Tag I get the following error-trace:

SEVERE: org.jdom2.IllegalNameException: The name "literatur"" is not legal for JDOM/XML attributes: XML name 'literatur"' cannot contain the character """.

So, now my question is: What are my options to take care of this error? Is there maybe a feature in NekoHTML I can use for this (using the "setFeature()"), or something within JDOM I could use?

If no: Are there other libaries that are suitable for scraping websites that can take such a thing as the quotation mark within the tag?

Thanks for your time!

Upvotes: 0

Views: 148

Answers (1)

Kirinkunhime
Kirinkunhime

Reputation: 105

Okay, so I solved the problem like following:

Since there wasn't any dependency on NekoHTML I switched to jTidy as parser which does the job in this case.

Question answered.

Upvotes: 1

Related Questions