Reputation: 667
Actually JSOUP is adding some extra encoded values to my HTML in my jSOUP parser.I am trying to take care of it by
String url = "http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html";
System.out.println("Fetching %s..."+url);
Document doc = Jsoup.connect(url).get();
//System.out.println(doc.html());
Document.OutputSettings settings = doc.outputSettings();
settings.prettyPrint(false);
settings.escapeMode(Entities.EscapeMode.base);
settings.charset("ASCII");
String html = doc.html();
System.out.println(html);
But the Entities class is not found for some reason and is giving an error. My included lib are
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
The original HTML is
<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
<head>
</head>
<body>
<div style="background-image: url(http://aka-cdn-ns.adtech.de/rm/ads/23274/HPWomenLOFT_1381687318.jpg);background-repeat: no-repeat;-webkit-background-size: 1001px 2059px; height: 2059px; width: 1001px; text-align: center; margin: 0 auto;">
<div style="height:2058px; padding-left:0px; padding-top:36px;">
<iframe style="height:90px; width:728px;" />
</div>
</div>
</body>
</html>
The doc.html()
from JSOUP gives this:
<!DOCTYPE html>
<html xmlns:og="http://opengraphprotocol.org/schema/" xmlns:fb="http://www.facebook.com/2008/fbml" xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en" class="SAF" id="global-header-light">
<head>
<style>
</style>
</head>
<body>
<div style="background-image: url(aol.jpeg); background-repeat: no-repeat;-webkit-background-size:90720;height:720; width:90; text-align: center; margin: 0 auto;">
<div style="height:450; width:100; padding-left:681px; padding-top:200px;">
<iframe style="height:1050px; width:300px;"></iframe> </div> </div> </body> </html>
</div>
</div>
</body>
</html>
The iframe element has been added some encoded stuff.
Please help.
Thanks Swaraj
Upvotes: 1
Views: 4298
Reputation: 17745
Actually jsoup is not adding the encoded stuff. Jsoup just adds the closing tags that seem to be missing. Let me explain.
First of all, jsoup tries to format your html. In your case that means that it will add closing tags that are missing. Example
Document doc = Jsoup.parse("<div>test<span>test");
System.out.println(doc.html());
Output:
<html>
<head></head>
<body>
<div>
test
<span>test</span>
</div>
</body>
</html>
If you check the encoded stuff you will realize that they are closing tags.
</div> = </div>
</div> = </div>
</body> = </body>
If you go to the site and press Ctrl+U (using chrome) then you will see what jsoup will parse. Chrome will give color to the valid html tags that it recognizes. For some odd reason it won't recognize the tags in the bottom (the same ones that appear with the escaped characters). For the same reason jsoup has a problem with those closing tags too. It doesn't treat them as closing tags, but as text, so it escapes them and then it normalizes the html by adding those tags as I explained earlier.
EDIT I managed to replicate the behavior.
Document doc = Jsoup.parse("<iframe /><span>test</span>");
System.out.println(doc.html());
You can see the exact same behavior. The problem is with the self closing iframe. Making it like this fixes the problem
Document doc = Jsoup.parse("<iframe></iframe><span>test</span>");
System.out.println(doc.html());
EDIT 2 If you want to just receive the html without building the document object you can do this
Connection.Response html = Jsoup.connect("http://iqtestsites.adtech.de/pictelatest/custombkgd/StylelistDevil.html").execute();
System.out.println(html.body());
Having the above, you can find the self closing iframe and replace it with the valid representation (or remove it completely). Then you can parse that string with Jsoup.parse()
This will fix the issue of not recognizing the closing tags after iframe, because it will be valid.
Upvotes: 4