Reputation: 2616
I want to parse HTML text in Java.
I have tried to parse HTML data using javax.swing.text.html.HTMLEditorKit. It helped me to get data from HTML. But I have a HTML data like -
<span class="TitleServiceChange" >Service Change</span>
<span class="DateStyle">
&nbsp;Posted:&nbsp;12/16/2012&nbsp; 8:00PM
</span><br/><br/>
<P>
with surrounding '<' and '>' instead of '<' and '>'
While parsing the above text I am getting the error -
Parsing error: start.missing body ? ? at
Please suggest me to resolve my problem. Thanks in advance.
Upvotes: 1
Views: 14448
Reputation: 13468
For unescaping the full set of escaped characters included at a string, you could make use of the Apache Commons Lang utility library.
Specifically, using the StringEscapeUtils class, where you can find the unescapeHtml4
method, among others.
Upvotes: 6
Reputation: 20885
HTML can be described in XML terms. XML has the concept of character data, obviously made up by characters. There are five characters that have special meaning in XML: >
, <
, &
, "
and '
- these are used to define elements and delimit attributes, so the parser doesn't treat them like normal characters. When you need to insert a <
literal in a XML document (like I just did in this answer), you can use a character reference in the form <
, so that the browser understands that you are not willing to start an XML tag. In HTML4 DTD there are 252 named entities, so it's infeasible to use replaceAll()
to have a readable string.
You'd better understand how HTML works, so that you think like a web browser when you have to architect storing and rendering of your data. Note that:
<tag>
has a very different meaning than
<tag>
So you'd better argument your question to get help in the right direction.
Upvotes: 1
Reputation: 25950
If you can get the String
representation of the data, replacing it with the correct tags could resolve your problem:
String htmlData = ...
htmlData = htmlData.replaceAll("<", "<");
htmlData = htmlData.replaceAll(">", ">");
Upvotes: 3