Reputation: 7210
The class you're gonna see right now is the classic approach to parse an XML document via XPath in Java:
public class Main {
private Document createXMLDocument(String fileName) throws Exception {
DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
domFactory.setNamespaceAware(true);
DocumentBuilder builder = domFactory.newDocumentBuilder();
Document doc = builder.parse(fileName);
return doc;
}
private NodeList readXMLNodes(Document doc, String xpathExpression) throws Exception {
XPath xpath = XPathFactory.newInstance().newXPath();
XPathExpression expr = xpath.compile(xpathExpression);
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
}
public static void main(String[] args) throws Exception {
Main m = new Main();
Document doc = m.createXMLDocument("tv.xml");
NodeList nodes = m.readXMLNodes(doc, "//serie/eason/@id");
int n = nodes.getLength();
Map<Integer, List<String>> series = new HashMap<Integer, List<String>>();
for (int i = 1; i <= n; i++) {
nodes = m.readXMLNodes(doc, "//serie/eason[@id='" + i + "']/episode/text()");
List<String> episodes = new ArrayList<String>();
for (int j = 0; j < nodes.getLength(); j++) {
episodes.add(nodes.item(j).getNodeValue());
}
series.put(i, episodes);
}
for (Map.Entry<Integer, List<String>> entry : series.entrySet()) {
System.out.println("Season: " + entry.getKey());
for (String ep : entry.getValue()) {
System.out.println("Episodio: " + ep);
}
System.out.println("+------------------------------------+");
}
}
}
In there I find some methods to be worrying in case of a huge xml file. Like the use of
Document doc = builder.parse(fileName);
return doc;
or
Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
return nodes;
I'm worried because the xml document I need to handle is created by the customer and inside you can basically have an indefinite number of records describing emails and their contents (every user has its own personal email, so lots of html in there). I know it's not the smartest approach but it's one of the possibilities and it was already up and running before I arrived here.
My question is: how can I parse
and evaluate
huge xml files using xpath?
Upvotes: 1
Views: 4044
Reputation: 163262
First of all, XPath doesn't parse XML. Your createXMLDocument() method does that, producing as output a tree representation of the parsed XML. The XPath is then used to search the tree representation.
What you are really looking for is something that searches the XML on the fly, while it is being parsed.
One way to do this is with an XQuery system that implements "document projection" (for example, Saxon-EE). This will analyze your query to see what parts of the document are needed, and when you parse your document, it will build a tree containing only those parts of the document that are actually needed.
If the query is as simple as the one in your example, however, then it isn't too hard to code it as a SAX application, where events such as startElement and endElement are notified by the XML parser to the application, without building a tree in memory.
Upvotes: 3
Reputation: 6817
You could use the StAX parser. It will take less memory than the DOM options. A good introduction to StAX is at http://tutorials.jenkov.com/java-xml/stax.html
Upvotes: 3