dierre
dierre

Reputation: 7210

Xpath approach in case of large files

The class you're gonna see right now is the classic approach to parse an XML document via XPath in Java:

public class Main {

    private Document createXMLDocument(String fileName) throws Exception {
        DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
        domFactory.setNamespaceAware(true);
        DocumentBuilder builder = domFactory.newDocumentBuilder();
        Document doc = builder.parse(fileName);

        return doc;
    }

    private NodeList readXMLNodes(Document doc, String xpathExpression) throws Exception {
        XPath xpath = XPathFactory.newInstance().newXPath();
        XPathExpression expr = xpath.compile(xpathExpression);

        Object result = expr.evaluate(doc, XPathConstants.NODESET);
        NodeList nodes = (NodeList) result;

        return nodes;
    }

    public static void main(String[] args) throws Exception {
        Main m = new Main();
        Document doc = m.createXMLDocument("tv.xml");
        NodeList nodes = m.readXMLNodes(doc, "//serie/eason/@id");
        int n = nodes.getLength();

        Map<Integer, List<String>> series = new HashMap<Integer, List<String>>();

        for (int i = 1; i <= n; i++) {
            nodes = m.readXMLNodes(doc, "//serie/eason[@id='" + i + "']/episode/text()");
            List<String> episodes = new ArrayList<String>();
            for (int j = 0; j < nodes.getLength(); j++) {
                episodes.add(nodes.item(j).getNodeValue());
            }
            series.put(i, episodes);
        }

        for (Map.Entry<Integer, List<String>> entry : series.entrySet()) {
            System.out.println("Season: " + entry.getKey());
            for (String ep : entry.getValue()) {
                System.out.println("Episodio: " + ep);
            }
            System.out.println("+------------------------------------+");
        }
    }
}

In there I find some methods to be worrying in case of a huge xml file. Like the use of

Document doc = builder.parse(fileName);

return doc;

or

  Object result = expr.evaluate(doc, XPathConstants.NODESET);
  NodeList nodes = (NodeList) result;

  return nodes;

I'm worried because the xml document I need to handle is created by the customer and inside you can basically have an indefinite number of records describing emails and their contents (every user has its own personal email, so lots of html in there). I know it's not the smartest approach but it's one of the possibilities and it was already up and running before I arrived here.

My question is: how can I parse and evaluate huge xml files using xpath?

Upvotes: 1

Views: 4044

Answers (2)

Michael Kay
Michael Kay

Reputation: 163262

First of all, XPath doesn't parse XML. Your createXMLDocument() method does that, producing as output a tree representation of the parsed XML. The XPath is then used to search the tree representation.

What you are really looking for is something that searches the XML on the fly, while it is being parsed.

One way to do this is with an XQuery system that implements "document projection" (for example, Saxon-EE). This will analyze your query to see what parts of the document are needed, and when you parse your document, it will build a tree containing only those parts of the document that are actually needed.

If the query is as simple as the one in your example, however, then it isn't too hard to code it as a SAX application, where events such as startElement and endElement are notified by the XML parser to the application, without building a tree in memory.

Upvotes: 3

Chetter Hummin
Chetter Hummin

Reputation: 6817

You could use the StAX parser. It will take less memory than the DOM options. A good introduction to StAX is at http://tutorials.jenkov.com/java-xml/stax.html

Upvotes: 3

Related Questions