Zoette
Zoette

Reputation: 1271

Can't read XML document with Java

I'm trying to parse an XML file: a sitemap which is on the web. I've tried many combinations but with no success. I'm sure I'm close but I don't find anything working...

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
org.w3c.dom.Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
System.out.println("XML = " + doc);

Output:

XML = [#document: null]

How come the output is [#document: null]?

The document

("https://www.lavisducagou.nc/page-sitemap.xml)

is indeed online.

Thanks in advance for your help.

Upvotes: 0

Views: 935

Answers (3)

Emre Savcı
Emre Savcı

Reputation: 3070

You need to iterate and find your xml elements. Here is a solution for getting loc and lastmod nodes for in the url nodes.

package com.yourPackage;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.IOException;
import java.net.URL;
import java.text.ParseException;

public class Main {
    public static void main(String[] args) throws ParseException, ParserConfigurationException, IOException, SAXException {
        DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
        factory.setNamespaceAware(true);

        Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
        doc.getDocumentElement().normalize();

        NodeList urlList = doc.getElementsByTagName("url");

        for (int i = 0; i < urlList.getLength(); i++) {
            Element url = (Element)urlList.item(i);

            Node loc = url.getElementsByTagName("loc").item(0);
            Node lastmod = url.getElementsByTagName("lastmod").item(0);

            System.out.println(loc.getTextContent());
            System.out.println(lastmod.getTextContent());
        }

    }
}

The output is :

https://www.lavisducagou.nc/
2018-07-14T11:30:25+11:00
https://www.lavisducagou.nc/sinscrire/
2018-05-03T16:58:35+11:00
https://www.lavisducagou.nc/se-connecter/
2018-05-03T18:02:07+11:00
https://www.lavisducagou.nc/mot-de-passe-oublie/
2018-05-03T20:33:08+11:00
https://www.lavisducagou.nc/compte/
2018-05-03T20:36:32+11:00
https://www.lavisducagou.nc/mon-profil/
2018-05-05T15:18:36+11:00
https://www.lavisducagou.nc/processus-de-paiement/
2018-05-07T15:23:39+11:00
https://www.lavisducagou.nc/paiement/
2018-05-07T23:44:51+11:00
https://www.lavisducagou.nc/historique-des-transactions/
2018-05-12T16:58:30+11:00
https://www.lavisducagou.nc/entreprise-standard/
2018-05-16T23:22:26+11:00
https://www.lavisducagou.nc/entreprise-premium/
2018-05-16T23:25:31+11:00
https://www.lavisducagou.nc/ajouter-une-entreprise/
2018-05-16T23:30:08+11:00
https://www.lavisducagou.nc/a-propos-de-nous/
2018-06-05T18:52:10+11:00
https://www.lavisducagou.nc/se-referencer/
2018-06-07T16:15:39+11:00
https://www.lavisducagou.nc/politique-de-confidentialite/
2018-06-15T09:15:11+11:00
https://www.lavisducagou.nc/donner-un-avis/
2018-06-16T10:55:24+11:00
https://www.lavisducagou.nc/conditions-dutilisation/
2018-06-19T16:39:44+11:00
https://www.lavisducagou.nc/annuaire-des-entreprises/
2018-06-19T20:51:22+11:00
https://www.lavisducagou.nc/pdf-generer/
2018-06-21T00:40:48+11:00
https://www.lavisducagou.nc/generer-pdf/
2018-06-21T00:51:22+11:00
https://www.lavisducagou.nc/contactez-nous/
2018-06-23T15:44:20+11:00
https://www.lavisducagou.nc/modifier-standard/
2018-06-23T20:04:01+11:00
https://www.lavisducagou.nc/pdf-generer-admin/
2018-06-30T19:19:01+11:00
https://www.lavisducagou.nc/conditions-generales-de-vente/
2018-07-02T15:19:51+11:00
https://www.lavisducagou.nc/modifier-standard-lentreprise/
2018-07-04T22:25:30+11:00
https://www.lavisducagou.nc/modifier-lentreprise/
2018-07-04T22:26:25+11:00
https://www.lavisducagou.nc/mentions-legales/
2018-07-27T16:08:01+11:00
https://www.lavisducagou.nc/jeu-concours/
2018-08-22T14:40:53+11:00

Upvotes: 1

Thomas Fritsch
Thomas Fritsch

Reputation: 10127

Actually your XML document was parsed and loaded correctly. You were only irritated by the rather stupid output of doc.toString() (which is called behind the scenes when evaluating "XML " + doc).

Beforehand you know the XML tag names to be expected (urlset, url, loc, lastmod) and how they are nested within each other.

XML structure

With this knowledge you just need to continue to walk inside the XML tree and extract the things you want. For example like this:

public static void main(String[] args) throws Exception {
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    factory.setNamespaceAware(true);
    Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());

    // Get the <urlset> root element
    Element urlsetElement = doc.getDocumentElement();

    // Get the list of <url> elements within the <urlset> element
    NodeList urlNodeList = urlsetElement.getElementsByTagName("url");

    for(int i = 0; i < urlNodeList.getLength(); i++) {
        // Get the <url> element
        Element urlElement = (Element) urlNodeList.item(i);

        // Get the <loc> element within the <url> element
        Element locElement = (Element) urlElement.getElementsByTagName("loc").item(0);
        // Print the text content of the <lo> element
        System.out.println("loc = " + locElement.getTextContent());

        // Get the <lastmod> element within the <url> element
        Element lastmodElement = (Element) urlElement.getElementsByTagName("lastmod").item(0);
        // Print the text content of the <lastmod> element
        System.out.println("lastmod = " + lastmodElement.getTextContent());
    }
}

You will get the output like this:

loc = https://www.lavisducagou.nc/
lastmod = 2018-07-14T11:30:25+11:00
loc = https://www.lavisducagou.nc/sinscrire/
lastmod = 2018-05-03T16:58:35+11:00
loc = https://www.lavisducagou.nc/se-connecter/
lastmod = 2018-05-03T18:02:07+11:00
loc = https://www.lavisducagou.nc/mot-de-passe-oublie/
lastmod = 2018-05-03T20:33:08+11:00
loc = https://www.lavisducagou.nc/compte/
lastmod = 2018-05-03T20:36:32+11:00
...

Upvotes: 1

Himanshu Bhardwaj
Himanshu Bhardwaj

Reputation: 4113

What you are looking at is just the toString implementation of com.sun.org.apache.xerces.internal.dom.DocumentImpl

public String toString() {
    return "["+getNodeName()+": "+getNodeValue()+"]";
}

Since the document has no node value hence it's null. What you need to do is get the childNodes and iterate and get the required details.

I can't access the URL with java due to firewall issue, but here's a small excerpt from the same file itself.

<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl"  href="//www.lavisducagou.nc/wp-content/plugins/wordpress-seo/css/main-sitemap.xsl"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.google.com/schemas/sitemap-image/1.1 http://www.google.com/schemas/sitemap-image/1.1/sitemap-image.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://www.lavisducagou.nc/</loc>
    <lastmod>2018-07-14T11:30:25+11:00</lastmod>
  </url>
  <url>
    <loc>https://www.lavisducagou.nc/sinscrire/</loc>
    <lastmod>2018-05-03T16:58:35+11:00</lastmod>
  </url>
</urlset>

Just updated your code with next steps:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
org.w3c.dom.Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
System.out.println("XML = " + doc);
NodeList nodeList = doc.getChildNodes();
for (int i=0; i<nodeList.getLength();i++) {
   System.out.println(nodeList.item(i).getNodeName());
}

Sample output:

XML = [#document: null]
xml-stylesheet
urlset

Upvotes: 1

Related Questions