HTML not downloaded correctly

Question

I've been trying to download the source code of the Google News rss feed. It's downloaded correctly except from links that are shown weirdly.

static String urlNotizie = "https://news.google.it/news/feeds?pz=1&cf=all&ned=it&hl=it&output=rss";
Document docHtml = Jsoup.connect(urlNotizie).get();
String html = docHtml.toString();
System.out.println(html);

Output:


 
 
  
   
    
     NFE/1.0
    
    Prima pagina - Google News
    http://news.google.it/news?pz=1&ned=it&hl=it
    
     it
    
    
     news-feedback@google.com
    
    
     &copy;2013 Google
     [...]

Using a URLConnection I'm able to output the correct source of the page. But during parse I have the same issue as above, where it spits a list of . (Again only with links. Parsing other things works fine). URLConnection example:

        URL u = new URL(urlNotizie);
        URLConnection yc = u.openConnection();

        StringBuilder builder = new StringBuilder();
        BufferedReader reader = new BufferedReader(new InputStreamReader(
                yc.getInputStream()));
        String line;
        while ((line = reader.readLine()) != null) {
            builder.append(line);
            builder.append("
");
        }
        String html = builder.toString();
        System.out.println("HTML " + html);

        Document doc = Jsoup.parse(html);

        Elements listaTitoli = doc.select("title");
        Elements listaCategorie = doc.select("category");
        Elements listaDescrizioni = doc.select("description");
        Elements listaUrl = doc.select("link");
        System.out.println(listaUrl);

BalusC · Accepted Answer

Jsoup is designed as a HTML parser, not as a XML (nor RSS) parser.

The HTML element is specified as not having any body. It would be invalid to have a element with a body like as in your XML.

You can parse XML using Jsoup, but you need to explicitly tell it to switch to XML parsing mode.

Replace

Document docHtml = Jsoup.connect(urlNotizie).get();

by

Document docXml = Jsoup.connect(urlNotizie).parser(Parser.xmlParser()).get();

HTML not downloaded correctly

Answers (1)

Related Questions