Reputation: 1271
I'm trying to parse an XML file: a sitemap which is on the web. I've tried many combinations but with no success. I'm sure I'm close but I don't find anything working...
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
org.w3c.dom.Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
System.out.println("XML = " + doc);
Output:
XML = [#document: null]
How come the output is [#document: null]
?
The document
("https://www.lavisducagou.nc/page-sitemap.xml
)
is indeed online.
Thanks in advance for your help.
Upvotes: 0
Views: 935
Reputation: 3070
You need to iterate and find your xml elements. Here is a solution for getting loc and lastmod nodes for in the url nodes.
package com.yourPackage;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.SAXException;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import java.io.IOException;
import java.net.URL;
import java.text.ParseException;
public class Main {
public static void main(String[] args) throws ParseException, ParserConfigurationException, IOException, SAXException {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
doc.getDocumentElement().normalize();
NodeList urlList = doc.getElementsByTagName("url");
for (int i = 0; i < urlList.getLength(); i++) {
Element url = (Element)urlList.item(i);
Node loc = url.getElementsByTagName("loc").item(0);
Node lastmod = url.getElementsByTagName("lastmod").item(0);
System.out.println(loc.getTextContent());
System.out.println(lastmod.getTextContent());
}
}
}
The output is :
https://www.lavisducagou.nc/
2018-07-14T11:30:25+11:00
https://www.lavisducagou.nc/sinscrire/
2018-05-03T16:58:35+11:00
https://www.lavisducagou.nc/se-connecter/
2018-05-03T18:02:07+11:00
https://www.lavisducagou.nc/mot-de-passe-oublie/
2018-05-03T20:33:08+11:00
https://www.lavisducagou.nc/compte/
2018-05-03T20:36:32+11:00
https://www.lavisducagou.nc/mon-profil/
2018-05-05T15:18:36+11:00
https://www.lavisducagou.nc/processus-de-paiement/
2018-05-07T15:23:39+11:00
https://www.lavisducagou.nc/paiement/
2018-05-07T23:44:51+11:00
https://www.lavisducagou.nc/historique-des-transactions/
2018-05-12T16:58:30+11:00
https://www.lavisducagou.nc/entreprise-standard/
2018-05-16T23:22:26+11:00
https://www.lavisducagou.nc/entreprise-premium/
2018-05-16T23:25:31+11:00
https://www.lavisducagou.nc/ajouter-une-entreprise/
2018-05-16T23:30:08+11:00
https://www.lavisducagou.nc/a-propos-de-nous/
2018-06-05T18:52:10+11:00
https://www.lavisducagou.nc/se-referencer/
2018-06-07T16:15:39+11:00
https://www.lavisducagou.nc/politique-de-confidentialite/
2018-06-15T09:15:11+11:00
https://www.lavisducagou.nc/donner-un-avis/
2018-06-16T10:55:24+11:00
https://www.lavisducagou.nc/conditions-dutilisation/
2018-06-19T16:39:44+11:00
https://www.lavisducagou.nc/annuaire-des-entreprises/
2018-06-19T20:51:22+11:00
https://www.lavisducagou.nc/pdf-generer/
2018-06-21T00:40:48+11:00
https://www.lavisducagou.nc/generer-pdf/
2018-06-21T00:51:22+11:00
https://www.lavisducagou.nc/contactez-nous/
2018-06-23T15:44:20+11:00
https://www.lavisducagou.nc/modifier-standard/
2018-06-23T20:04:01+11:00
https://www.lavisducagou.nc/pdf-generer-admin/
2018-06-30T19:19:01+11:00
https://www.lavisducagou.nc/conditions-generales-de-vente/
2018-07-02T15:19:51+11:00
https://www.lavisducagou.nc/modifier-standard-lentreprise/
2018-07-04T22:25:30+11:00
https://www.lavisducagou.nc/modifier-lentreprise/
2018-07-04T22:26:25+11:00
https://www.lavisducagou.nc/mentions-legales/
2018-07-27T16:08:01+11:00
https://www.lavisducagou.nc/jeu-concours/
2018-08-22T14:40:53+11:00
Upvotes: 1
Reputation: 10127
Actually your XML document was parsed and loaded correctly.
You were only irritated by the rather stupid output of doc.toString()
(which is called behind the scenes when evaluating "XML " + doc
).
Beforehand you know the XML tag names to be expected (urlset
, url
, loc
, lastmod
)
and how they are nested within each other.
With this knowledge you just need to continue to walk inside the XML tree and extract the things you want. For example like this:
public static void main(String[] args) throws Exception {
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
// Get the <urlset> root element
Element urlsetElement = doc.getDocumentElement();
// Get the list of <url> elements within the <urlset> element
NodeList urlNodeList = urlsetElement.getElementsByTagName("url");
for(int i = 0; i < urlNodeList.getLength(); i++) {
// Get the <url> element
Element urlElement = (Element) urlNodeList.item(i);
// Get the <loc> element within the <url> element
Element locElement = (Element) urlElement.getElementsByTagName("loc").item(0);
// Print the text content of the <lo> element
System.out.println("loc = " + locElement.getTextContent());
// Get the <lastmod> element within the <url> element
Element lastmodElement = (Element) urlElement.getElementsByTagName("lastmod").item(0);
// Print the text content of the <lastmod> element
System.out.println("lastmod = " + lastmodElement.getTextContent());
}
}
You will get the output like this:
loc = https://www.lavisducagou.nc/
lastmod = 2018-07-14T11:30:25+11:00
loc = https://www.lavisducagou.nc/sinscrire/
lastmod = 2018-05-03T16:58:35+11:00
loc = https://www.lavisducagou.nc/se-connecter/
lastmod = 2018-05-03T18:02:07+11:00
loc = https://www.lavisducagou.nc/mot-de-passe-oublie/
lastmod = 2018-05-03T20:33:08+11:00
loc = https://www.lavisducagou.nc/compte/
lastmod = 2018-05-03T20:36:32+11:00
...
Upvotes: 1
Reputation: 4113
What you are looking at is just the toString implementation of com.sun.org.apache.xerces.internal.dom.DocumentImpl
public String toString() {
return "["+getNodeName()+": "+getNodeValue()+"]";
}
Since the document has no node value hence it's null. What you need to do is get the childNodes and iterate and get the required details.
I can't access the URL with java due to firewall issue, but here's a small excerpt from the same file itself.
<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet type="text/xsl" href="//www.lavisducagou.nc/wp-content/plugins/wordpress-seo/css/main-sitemap.xsl"?>
<urlset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:image="http://www.google.com/schemas/sitemap-image/1.1" xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9 http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd http://www.google.com/schemas/sitemap-image/1.1 http://www.google.com/schemas/sitemap-image/1.1/sitemap-image.xsd" xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://www.lavisducagou.nc/</loc>
<lastmod>2018-07-14T11:30:25+11:00</lastmod>
</url>
<url>
<loc>https://www.lavisducagou.nc/sinscrire/</loc>
<lastmod>2018-05-03T16:58:35+11:00</lastmod>
</url>
</urlset>
Just updated your code with next steps:
DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);
org.w3c.dom.Document doc = factory.newDocumentBuilder().parse(new URL("https://www.lavisducagou.nc/page-sitemap.xml").openStream());
System.out.println("XML = " + doc);
NodeList nodeList = doc.getChildNodes();
for (int i=0; i<nodeList.getLength();i++) {
System.out.println(nodeList.item(i).getNodeName());
}
Sample output:
XML = [#document: null]
xml-stylesheet
urlset
Upvotes: 1