Reputation: 700
I'm looking for the best way to dynamically modify the tags of a very large XML file.
Consider the following input XML:
Input
<?xml version="1.0" encoding="UTF-8"?>
<rootTag>
<dictionary>
<name>field1</name>
<address>field2</address>
<gender>field3</gender>
.
.
<postcode>field30</postcode>
</dictionary>
<records>
<record>
<field id="field1">John</field>
<field id="field2">Svalbard</field>
<field id="field3">M</field>
.
.
<field id="field30">12345</field>
</record>
.
.
<record>
.
.
</record>
</records>
</rootTag>
The XML file contains a dictionary on top and a huge chunk of record nodes, whose tags are linked to the dictionary.
I'd like to replace the tags within each record node to their corresponding value from the dictionary. Thus, the output should look like:
Output
<?xml version="1.0" encoding="UTF-8"?>
<rootTag>
<records>
<record>
<name>John</name>
<address>Svalbard</address>
<gender>M</gender>
.
.
<postcode>12345</postcode>
</record>
.
.
<record>
.
.
</record>
</records>
</rootTag>
Keeping in mind that there are a tremendously large number of <record>
nodes, what's the best way to achieve this transformation in Java?
Note that I only want to change the tags and not the attributes.
Upvotes: 2
Views: 142
Reputation: 500
Why not parse the XML manually?
import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import junit.framework.Assert;
import org.junit.Test;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
public class ReplaceTextInXmlTest
{
@Test
public void test(
) {
try {
final String inputXml = new String(
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<rootTag>\n" +
" <dictionary>\n" +
" <name>field1</name>\n" +
" <address>field2</address>\n" +
" <gender>field3</gender>\n" +
" </dictionary>\n" +
" <records>\n" +
" <record>\n" +
" <field id=\"field1\">John</field>\n" +
" <field id=\"field2\">Svalbard</field>\n" +
" <field id=\"field3\">M</field>\n" +
" </record>\n" +
" <field id=\"field1\">Fritz</field>\n" +
" <field id=\"field2\">Hamburg</field>\n" +
" <field id=\"field3\">M</field>\n" +
" </record>\n" +
" </records>\n" +
"</rootTag>"
);
final Map<Integer, String> mapping = new HashMap<>();
final int start = inputXml.indexOf("<dictionary>");
final int end = inputXml.indexOf("</dictionary>", start) + 13; // "</dictionary>".length() = 13
final DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
final DocumentBuilder db = dbf.newDocumentBuilder();
Document dom = null;
try (
ByteArrayInputStream is = new ByteArrayInputStream(inputXml.substring(start, end).getBytes());
) {
dom = db.parse(is);
}
final Element root = dom.getDocumentElement();
final NodeList nodes = root.getChildNodes();
for(int i = 0, z = nodes.getLength(); i < z; ++i) {
final Node node = nodes.item(i);
final int type = node.getNodeType();
if(type == 1) {
final String name = node.getNodeName();
final String value = node.getTextContent();
mapping.put(new Integer(Integer.parseInt(value.substring(5))), name); // "field".length() = 5
}
}
final Pattern fieldPattern = Pattern.compile("^(\\s*<)field id=\"field([0-9]+)\" (>[^<]*</)field(>\\s*)$");
final StringBuilder outputXml = new StringBuilder();
try (
BufferedReader reader = new BufferedReader(new StringReader(inputXml));
) {
String line = null;
while ((line = reader.readLine()) != null) {
final Matcher match = fieldPattern.matcher(line);
if(match.find() == true) {
final int fieldId = Integer.parseInt(match.group(2));
final String tagName = mapping.get(new Integer(fieldId));
outputXml.append(match.group(1));
outputXml.append(tagName);
outputXml.append(match.group(3));
outputXml.append(tagName);
outputXml.append(match.group(4));
} else {
outputXml.append(line);
}
outputXml.append('\n');
}
}
final String expectedXml = new String(
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<rootTag>\n" +
" <dictionary>\n" +
" <name>field1</name>\n" +
" <address>field2</address>\n" +
" <gender>field3</gender>\n" +
" </dictionary>\n" +
" <records>\n" +
" <record>\n" +
" <name>John</name>\n" +
" <address>Svalbard</address>\n" +
" <gender>M</gender>\n" +
" </record>\n" +
" <name>Fritz</name>\n" +
" <address>Hamburg</address>\n" +
" <gender>M</gender>\n" +
" </record>\n" +
" </records>\n" +
"</rootTag>\n"
);
Assert.assertEquals(expectedXml, outputXml.toString());
} catch (final Exception e) {
Assert.fail(e.getMessage());
}
}
}
Upvotes: 0
Reputation: 3428
I agree with @PeterJaloveczki that xslt could be the way. Following could make the job
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:fn="http://www.w3.org/2005/xpath-functions">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="node() | @*" />
</xsl:copy>
</xsl:template>
<xsl:template match="dictionary" />
<xsl:template match="field">
<xsl:variable name="id" select="@id" />
<xsl:variable name="tagName" select="/rootTag/dictionary/node()[. = $id]/name()" />
<xsl:element name="{if ($tagName != '') then $tagName else 'field'}">
<xsl:apply-templates select="node() | @*[name() != 'id']" />
</xsl:element>
</xsl:template>
</xsl:stylesheet>
It is simplified in some points because xml examples are also simplified but basically it should work.
Upvotes: 1
Reputation: 358
SAX Parser is the way to go as it parses the XML as a stream instead of reading it at one shot. See this for details: http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html
Upvotes: 0
Reputation: 136162
One option is to use StAX, it has high performance, it processes xml as stream without loading the whole xml in memory, and it is convenient to use.
Upvotes: 0
Reputation: 1236
I would probably go with a SAX XML parser, which would make sure you do not load the whole DOM tree at once.
In short, you would first populate a dictionary and then, for each tag, one by one as you parse them, replace its name with whatever dictionary contains.
An example on how to approach SAX paring in Java: http://docs.oracle.com/javase/tutorial/jaxp/sax/parsing.html
Upvotes: 0