Reputation: 316
I'm reading information from a RSS that store on description
tag HTML code, it isn't text plain. I need to extract some information like the first image would appear but I can't do it because all tags that are inside description
aren't parsed by Jsoup I suppose by the behaviour of CDATA element.
On my question I referer to "Automatic way" because I saw on other question published here that I would use .replace()
to remove CDATA but it does not seem me an effective solution as I think it would serve to specific cases, not for universal purpose. So my question is if there is a way to Jsoup make that parse without replacing text by me? Is this the only way that exists? I should use other library?
For example, when I parsed the RSS document, the node description has this:
<table width='100%' border='0' cellspacing='0' cellpadding='4'><tr><td align='left' width='10'><
a href='http://www.3djuegos.com/noticia/145062/0/bioware-nuevo-juego-ip/video-gamescom/trailer/'><img src='http://i11c.3djuegos.com/juegos/7332/dragon_age_iii/fotos/noticias/dragon_age_iii-2583054.jpg' border='0' width='70' height='52' />
</a></td><td align='left' valign='top'>Parece ser una nueva licencia creativa, según lo visto en un enigm&aacu
All special chars "<>" are scaped because CDATA works so. The rest of document is well parsed only happens with CDATA content.
The code that I use to access:
doc = Jsoup.connect("http://www.3djuegos.com/universo/rss/rss.php?plats=1-2-3-4-5-6-7-34&tipos=noticia-analisis-avance-video-imagenes-demo&fotos=peques&limit=20").get();
System.out.println(doc.html()); // Shows the document well parsed.
Elements nodes = doc.getElementsByTag("item"); // Access to news
for(int i = 0; i < nodes.size(); i++){ // Loop all news
// Description node
Element decriptionNode = nodes.get(i).getElementsByTag("description").get(0);
// Shows content of node. Here is where HTML tags are escaped
System.out.println(nodes.get(i).getElementsByTag("description").html()); // Here prints the content of description tag and all HTML tags are escaped by default
// Access to first image and here fails because of description text is escaped
// and then Jsoup cant parsed as nodes
Element imageNode = descriptionNode.getElementsByTag("img").get(0);
}
Edit: I use doc.outputSettings().escapeMode(EscapeMode.xhtml)
but I suppose that it doesn't affect to CDATA content.
Edit2: I use as workaround the library org.apache.commons.lang3.StringEscapeUtils
that lets unescape html but I'm still thinking about if Jsoup has already something to this scenario.
Upvotes: 1
Views: 3913
Reputation: 8509
You could use the text()
method to get unescaped value. That mean if an element has the value like <table width='100%' border='0' cellspacing='0' cellpadding='4'>
then when you do element.text()
it returns <table width='100%' border='0' cellspacing='0' cellpadding='4'>
. So you can parse this fragment again to get whatever you want from this. Eg.
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class Sample {
public static void main(String[] args) throws Exception {
String html = "<description>"
+ "<table width='100%' border='0' cellspacing='0' cellpadding='4'><tr><td align='left' width='10'><"
+ "a href='http://www.3djuegos.com/noticia/145062/0/bioware-nuevo-juego-ip/video-gamescom/trailer/'><img src='http://i11c.3djuegos.com/juegos/7332/dragon_age_iii/fotos/noticias/dragon_age_iii-2583054.jpg' border='0' width='70' height='52' />"
+ "</a></td><td align='left' valign='top'>Parece ser una nueva licencia creativa, según lo visto en un enigm&aacu"
+ "</description>";
Document doc = Jsoup.parse(html);
for(Element desc : doc.select("description")){
String unescapedHtml = desc.text();
String src = Jsoup.parse(unescapedHtml).select("img").first().attr("src");
System.out.println(src);
}
System.out.println("Done");
}
}
Upvotes: 4