Reputation: 1363
I've been trying to download the source code of the Google News rss feed. It's downloaded correctly except from links that are shown weirdly.
static String urlNotizie = "https://news.google.it/news/feeds?pz=1&cf=all&ned=it&hl=it&output=rss";
Document docHtml = Jsoup.connect(urlNotizie).get();
String html = docHtml.toString();
System.out.println(html);
Output:
<html>
<head></head>
<body>
<rss version="2.0">
<channel>
<generator>
NFE/1.0
</generator>
<title>Prima pagina - Google News</title>
<link />http://news.google.it/news?pz=1&ned=it&hl=it
<language>
it
</language>
<webmaster>
[email protected]
</webmaster>
<copyright>
&copy;2013 Google
</copyright> [...]
Using a URLConnection I'm able to output the correct source of the page. But during parse I have the same issue as above, where it spits a list of <link />.
(Again only with links. Parsing other things works fine). URLConnection example:
URL u = new URL(urlNotizie);
URLConnection yc = u.openConnection();
StringBuilder builder = new StringBuilder();
BufferedReader reader = new BufferedReader(new InputStreamReader(
yc.getInputStream()));
String line;
while ((line = reader.readLine()) != null) {
builder.append(line);
builder.append("\n");
}
String html = builder.toString();
System.out.println("HTML " + html);
Document doc = Jsoup.parse(html);
Elements listaTitoli = doc.select("title");
Elements listaCategorie = doc.select("category");
Elements listaDescrizioni = doc.select("description");
Elements listaUrl = doc.select("link");
System.out.println(listaUrl);
Upvotes: 0
Views: 72
Reputation: 1109182
Jsoup is designed as a HTML parser, not as a XML (nor RSS) parser.
The HTML <link>
element is specified as not having any body. It would be invalid to have a <link>
element with a body like as in your XML.
You can parse XML using Jsoup, but you need to explicitly tell it to switch to XML parsing mode.
Replace
Document docHtml = Jsoup.connect(urlNotizie).get();
by
Document docXml = Jsoup.connect(urlNotizie).parser(Parser.xmlParser()).get();
Upvotes: 1