Deepu
Deepu

Reputation: 2616

Parse HTML data in Java including &lt and &gt tags?

I want to parse HTML text in Java.

I have tried to parse HTML data using javax.swing.text.html.HTMLEditorKit. It helped me to get data from HTML. But I have a HTML data like -

<span class="TitleServiceChange" >Service Change</span>
                    <span class="DateStyle">
                     Posted: 12/16/2012  8:00PM
                    </span><br/><br/>
                  <P>

with surrounding '&lt' and '&gt' instead of '<' and '>'

While parsing the above text I am getting the error -

Parsing error: start.missing body ? ? at

Please suggest me to resolve my problem. Thanks in advance.

Upvotes: 1

Views: 14448

Answers (3)

Tomas Narros
Tomas Narros

Reputation: 13468

For unescaping the full set of escaped characters included at a string, you could make use of the Apache Commons Lang utility library.

Specifically, using the StringEscapeUtils class, where you can find the unescapeHtml4 method, among others.

Upvotes: 6

Raffaele
Raffaele

Reputation: 20885

HTML can be described in XML terms. XML has the concept of character data, obviously made up by characters. There are five characters that have special meaning in XML: >, <, &, " and ' - these are used to define elements and delimit attributes, so the parser doesn't treat them like normal characters. When you need to insert a < literal in a XML document (like I just did in this answer), you can use a character reference in the form &lt;, so that the browser understands that you are not willing to start an XML tag. In HTML4 DTD there are 252 named entities, so it's infeasible to use replaceAll() to have a readable string.

You'd better understand how HTML works, so that you think like a web browser when you have to architect storing and rendering of your data. Note that:

&lt;tag&gt;

has a very different meaning than

<tag>

So you'd better argument your question to get help in the right direction.

Upvotes: 1

Juvanis
Juvanis

Reputation: 25950

If you can get the String representation of the data, replacing it with the correct tags could resolve your problem:

String htmlData = ...

htmlData = htmlData.replaceAll("&lt;", "<");
htmlData = htmlData.replaceAll("&gt;", ">");

Upvotes: 3

Related Questions