Angelo Tricarico
Angelo Tricarico

Reputation: 1363

HTML not downloaded correctly

I've been trying to download the source code of the Google News rss feed. It's downloaded correctly except from links that are shown weirdly.

static String urlNotizie = "https://news.google.it/news/feeds?pz=1&cf=all&ned=it&hl=it&output=rss";
Document docHtml = Jsoup.connect(urlNotizie).get();
String html = docHtml.toString();
System.out.println(html);

Output:

<html>
 <head></head>
 <body>
  <rss version="2.0">
   <channel>
    <generator>
     NFE/1.0
    </generator>
    <title>Prima pagina - Google News</title>
    <link />http://news.google.it/news?pz=1&amp;ned=it&amp;hl=it
    <language>
     it
    </language>
    <webmaster>
     [email protected]
    </webmaster>
    <copyright>
     &amp;copy;2013 Google
    </copyright> [...]

Using a URLConnection I'm able to output the correct source of the page. But during parse I have the same issue as above, where it spits a list of <link />. (Again only with links. Parsing other things works fine). URLConnection example:

        URL u = new URL(urlNotizie);
        URLConnection yc = u.openConnection();

        StringBuilder builder = new StringBuilder();
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                yc.getInputStream()));
        String line;
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("\n");
        }
        String html = builder.toString();
        System.out.println("HTML " + html);

        Document doc = Jsoup.parse(html);

        Elements listaTitoli = doc.select("title");
        Elements listaCategorie = doc.select("category");
        Elements listaDescrizioni = doc.select("description");
        Elements listaUrl = doc.select("link");
        System.out.println(listaUrl);

Upvotes: 0

Views: 72

Answers (1)

BalusC
BalusC

Reputation: 1109182

Jsoup is designed as a HTML parser, not as a XML (nor RSS) parser.

The HTML <link> element is specified as not having any body. It would be invalid to have a <link> element with a body like as in your XML.

You can parse XML using Jsoup, but you need to explicitly tell it to switch to XML parsing mode.

Replace

Document docHtml = Jsoup.connect(urlNotizie).get();

by

Document docXml = Jsoup.connect(urlNotizie).parser(Parser.xmlParser()).get();

Upvotes: 1

Related Questions