Reputation: 409
For example the giant file has 50 million lines of such:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
<activity>
<deliv>
<subitem1>text</subitem1>
<subitem2>text</subitem2>
</deliv>
<deliv>
<subitem1>text</subitem1>
<subitem2>text</subitem2>
</deliv>
<deliv>
<subitem1>text</subitem1>
<subitem2>text</subitem2>
</deliv>
</activity>
</root>
And each 'child' file would have the same structure, but be 5 million lines or so, or 1/10th of the original.
The reason for this is to make the import of such into a database more manageable, without blowing out the memory (SQL Server's OPENXML).
Is XSLT the best choice here?
Upvotes: 0
Views: 201
Reputation: 167471
The Enterprise Edition of Saxon 9.8 (Saxon 9.8 EE) supports the streaming feature of the one year old XSLT 3.0 specification which allows you to use a subset of XSLT to read through an XML documents in a forwards only way, using only the memory necessary to store the currently visited node and its ancestors.
Using that approach you can write code like for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size"
to do a positional grouping that reads through the file deliv
by deliv
element and collects them into groups of $size
:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:param name="size" as="xs:integer" select="1000"/>
<xsl:mode on-no-match="shallow-copy" streamable="yes"/>
<xsl:template match="root">
<xsl:for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size">
<xsl:result-document href="split-{format-number(current-grouping-key() + 1, '00000')}.xml" indent="yes">
<root>
<activity>
<xsl:copy-of select="current-group()"/>
</activity>
</root>
</xsl:result-document>
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
That splits up your input into a number of files, each file having $size
deliv
elements (respectively the last one the remaining deliv
elements if there are less than $size
left).
Using Saxon EE requires obtaining a commercial license but trial licences exist.
Upvotes: 3
Reputation: 29022
XSLT-2.0 and above is a good fit for this task.
XSLT-3.0 even supports streaming.
The following stylesheet splits an XML file in a configurable amount of files using xsl:result-document
.
It takes two parameters:
split
- the number of items in each splitdoc
- the name of the source documentThis is the XSLT-2.0 template customized to your example (split.xslt
):
<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:param name="split" select="2" /> <!-- number of entries in each split -->
<xsl:param name="doc" select="'src.xml'" /> <!-- name of source XML -->
<xsl:template match="/">
<xsl:variable name="cnt" select="xs:integer(count(document($doc)/root/activity/deliv) div xs:integer($split))" />
<xsl:value-of select="concat('#',$cnt,'#')" />
<xsl:for-each select="0 to $cnt">
<xsl:variable name="cur" select="xs:integer(.)" />
<xsl:result-document method="xml" href="output_no_{$cur}.xml" exclude-result-prefixes="xs">
<root>
<activity>
<xsl:for-each select="document($doc)/root/activity/deliv[position() gt (xs:integer($split) * $cur) and position() le (xs:integer($split) * ($cur + 1))]">
<xsl:copy-of select="."/>
</xsl:for-each>
</activity>
</root>
</xsl:result-document>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
With a current version of Saxon you can call it like this:
java -jar saxon9he.jar -xsl:split.xslt src.xml doc=src.xml split=2
Upvotes: 2
Reputation: 46
XSLT could do this job. I'd recommend getting your hands on an XSLT v2.0 processor so that you can use xsl:result-document. Then you'd need a little bit of logic to decide when to split between your files. You could base this off the position() of the deliv elements, or try using xsl:for-each-group to make groups that are sent to each file.
Upvotes: 1