simplysiby
simplysiby

Reputation: 584

Remove escaped text from JSOUP parsed HTML

I am parsing an HTML using JSOUP which contains some elements as well. However, when I print the resulting doc.html(), instead of :

<script language="JavaScript"> <a href="http://www.company.com/index.htm" </a> </script> 

I am getting :

<script language="JavaScript"> &lt;a href=&quot;http://www.company.com/index.htm&quot; &lt;/a&gt; </script>

In the code, I do a manipulation like the following :

for (final Element src : doc.select("script")) { 
data = data.replace(someText,newText);
src.text(data); <==== I could find this method escapes the text }

I am using UTF-8 char set.

How can I get the unescaped text directly ? Thanks in advance !

Upvotes: 4

Views: 3682

Answers (3)

Ben Weiss
Ben Weiss

Reputation: 17922

I ran into the same problem. The StringEscapeUtils from Apache Commons seem to do the trick.

String html = StringEscapeUtils.unescapeHtml4(document.html());

IMO it's not the best solution to this problem, but it works for me.

Upvotes: 1

simplysiby
simplysiby

Reputation: 584

Hey thanks for all your help... we solved the problem using :

src.childNode(0).attr("data", data);

Upvotes: 4

Gabriele Petrioli
Gabriele Petrioli

Reputation: 195982

use the .html() method instead

src.html(data)

Upvotes: 0

Related Questions