Reputation: 86747
I want to parse a xml element that has the following incidents:
<employees> <employee> <details> <name>Joe</name> <age>34</age> </details> <address> <street>test</street> <nr>12</nr> </address> </employee> <employee> <address>....</address> <details> <!-- note the changed order of elements! --> <age>24</age> <name>Sam</name> </details> </employee> </employees>
Output should be a csv:
name;age;street;nr
Joe,34,test,12
Sam,24,...
Problem: when using event-driven parsers like stax/sax
, I would have to create a temporary Employee
bean whose properties I set on each event node, and lateron convert the bean to csv.
But as my xml file is several GB in size, I'd like to prevent having to create additional bean objects for each entry.
Thus I probably have to use plain old DOM
parsing? Correct my if I'm wrong, I'm happy for any suggestions.
I tried as follows. Problem is that doc.getElementsByTagName("employees")
returns an empty nodelist, while I'd expect one xml element. Why?
StringBuilder sb = new StringBuilder();
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new StringReader(xml)));
doc.getDocumentElement().normalize();
NodeList employees = doc.getElementsByTagName("employees");
for (int i = 0; i < employees.getLength(); i++) {
Node employee = employees.item(i);
if (employees.getNodeType() == Node.ELEMENT_NODE) {
NodeList employee = ((Element) employees).getElementsByTagName("employee");
for (int j = 0; j < employee.getLength(); j++) {
NodeList details = ((Element) employee).getElementsByTagName("details");
//the rest is pseudocode
for (details)
sb.append(getElements("name").item(0) + ",");
sb.append(getElements("age").item(0) + ",");
for (address)
sb.append(getElements("street").item(0) + ",");
sb.append(getElements("nr").item(0) + ",");
}
}
}
Upvotes: 0
Views: 362
Reputation: 163322
A DOM solution is going to use a lot of memory, a SAX/Stax solution is going to involve writing and debugging a lot of code. The ideal tool for this job is an XSLT 3.0 streamable transformation:
<xsl:transform version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:mode streamable="yes" on-no-match="shallow-skip"/>
<xsl:template match="employee">
<xsl:value-of select="copy-of(.)!(.//name, .//age, .//street, .//nr)"
separator=","/>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:transform>
NOTE
I originally wrote the select expression as copy-of(.)//(name, age, street, nr)
. This is incorrect, because the //
operator sorts the results into document order, which we don't want. The use of !
and ,
carefully avoids the sorting.
Upvotes: 3
Reputation: 109557
Do not use a StringBuilder but write immediately to the file (Files.newBufferedWriter).
It is not a big deal to manually parse the XML as there does not seem to be a high level of complexity, neither need of XML based validation.
&
that should be &
in XML.If the XML is valid (you could have a Reader that adds <?xml ...>
in front), scanning through the XML would be:
XMLInputFactory f = XMLInputFactory.newInstance();
XMLStreamReader r = f.createXMLStreamReader( ... );
while(r.hasNext()) {
r.next();
}
That easily allows maintaing a Map for employee attributes, started with <employee>
and ending, being validated and written at </employee>
.
Upvotes: 1