Reputation: 584
I am parsing an HTML using JSOUP which contains some elements as well. However, when I print the resulting doc.html(), instead of :
<script language="JavaScript"> <a href="http://www.company.com/index.htm" </a> </script>
I am getting :
<script language="JavaScript"> <a href="http://www.company.com/index.htm" </a> </script>
In the code, I do a manipulation like the following :
for (final Element src : doc.select("script")) {
data = data.replace(someText,newText);
src.text(data); <==== I could find this method escapes the text }
I am using UTF-8 char set.
How can I get the unescaped text directly ? Thanks in advance !
Upvotes: 4
Views: 3682
Reputation: 17922
I ran into the same problem. The StringEscapeUtils from Apache Commons seem to do the trick.
String html = StringEscapeUtils.unescapeHtml4(document.html());
IMO it's not the best solution to this problem, but it works for me.
Upvotes: 1
Reputation: 584
Hey thanks for all your help... we solved the problem using :
src.childNode(0).attr("data", data);
Upvotes: 4