How to parse large xml document with DOM?

Question

I want to parse a xml element that has the following incidents:

and no xml declaration
can serve the elements in no particular order

Output should be a csv:

name;age;street;nr
Joe,34,test,12
Sam,24,...

Problem: when using event-driven parsers like stax/sax, I would have to create a temporary Employee bean whose properties I set on each event node, and lateron convert the bean to csv.

But as my xml file is several GB in size, I'd like to prevent having to create additional bean objects for each entry.

Thus I probably have to use plain old DOM parsing? Correct my if I'm wrong, I'm happy for any suggestions.

I tried as follows. Problem is that doc.getElementsByTagName("employees") returns an empty nodelist, while I'd expect one xml element. Why?

StringBuilder sb = new StringBuilder();

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(xml)));
doc.getDocumentElement().normalize();

NodeList employees = doc.getElementsByTagName("employees");
for (int i = 0; i < employees.getLength(); i++) {
    Node employee = employees.item(i);
    if (employees.getNodeType() == Node.ELEMENT_NODE) {
        NodeList employee = ((Element) employees).getElementsByTagName("employee");
        for (int j = 0; j < employee.getLength(); j++) {
            NodeList details = ((Element) employee).getElementsByTagName("details");

            //the rest is pseudocode
            for (details)
                sb.append(getElements("name").item(0) + ",");
                sb.append(getElements("age").item(0) + ",");    

            for (address) 
                sb.append(getElements("street").item(0) + ",");
                sb.append(getElements("nr").item(0) + ",");
        }
    }
}

Michael Kay · Accepted Answer

A DOM solution is going to use a lot of memory, a SAX/Stax solution is going to involve writing and debugging a lot of code. The ideal tool for this job is an XSLT 3.0 streamable transformation:

NOTE

I originally wrote the select expression as copy-of(.)//(name, age, street, nr). This is incorrect, because the // operator sorts the results into document order, which we don't want. The use of ! and , carefully avoids the sorting.

How to parse large xml document with DOM?

Answers (2)

Related Questions