Reputation: 2616

Parse HTML data in Java including &lt and &gt tags?

I want to parse HTML text in Java.

I have tried to parse HTML data using javax.swing.text.html.HTMLEditorKit. It helped me to get data from HTML. But I have a HTML data like -

&lt;span class="TitleServiceChange" &gt;Service Change&lt;/span&gt;
                    &lt;span class="DateStyle"&gt;
                    &amp;nbsp;Posted:&amp;nbsp;12/16/2012&amp;nbsp; 8:00PM
                    &lt;/span&gt;&lt;br/&gt;&lt;br/&gt;
                  &lt;P&gt;

with surrounding '&lt' and '&gt' instead of '<' and '>'

While parsing the above text I am getting the error -

Parsing error: start.missing body ? ? at

Please suggest me to resolve my problem. Thanks in advance.

Upvotes: 1

Answers (3)

Tomas Narros

Reputation: 13468

For unescaping the full set of escaped characters included at a string, you could make use of the Apache Commons Lang utility library.

Specifically, using the StringEscapeUtils class, where you can find the unescapeHtml4 method, among others.

Upvotes: 6

Raffaele

Reputation: 20885

HTML can be described in XML terms. XML has the concept of character data, obviously made up by characters. There are five characters that have special meaning in XML: >, <, &, " and ' - these are used to define elements and delimit attributes, so the parser doesn't treat them like normal characters. When you need to insert a < literal in a XML document (like I just did in this answer), you can use a character reference in the form <, so that the browser understands that you are not willing to start an XML tag. In HTML4 DTD there are 252 named entities, so it's infeasible to use replaceAll() to have a readable string.

You'd better understand how HTML works, so that you think like a web browser when you have to architect storing and rendering of your data. Note that:

&lt;tag&gt;

has a very different meaning than

<tag>

So you'd better argument your question to get help in the right direction.

Upvotes: 1

Juvanis

Reputation: 25950

If you can get the String representation of the data, replacing it with the correct tags could resolve your problem:

String htmlData = ...

htmlData = htmlData.replaceAll("&lt;", "<");
htmlData = htmlData.replaceAll("&gt;", ">");

Upvotes: 3

Parse HTML data in Java including &amp;lt and &amp;gt tags?

Answers (3)

Related Questions

Parse HTML data in Java including &lt and &gt tags?