sqldoug
sqldoug

Reputation: 409

Split giant XML file into n-child versions

For example the giant file has 50 million lines of such:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<root>
  <activity>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
    <deliv>
      <subitem1>text</subitem1>
      <subitem2>text</subitem2>
    </deliv>
  </activity>
</root>

And each 'child' file would have the same structure, but be 5 million lines or so, or 1/10th of the original.

The reason for this is to make the import of such into a database more manageable, without blowing out the memory (SQL Server's OPENXML).

Is XSLT the best choice here?

Upvotes: 0

Views: 201

Answers (3)

Martin Honnen
Martin Honnen

Reputation: 167471

The Enterprise Edition of Saxon 9.8 (Saxon 9.8 EE) supports the streaming feature of the one year old XSLT 3.0 specification which allows you to use a subset of XSLT to read through an XML documents in a forwards only way, using only the memory necessary to store the currently visited node and its ancestors.

Using that approach you can write code like for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size" to do a positional grouping that reads through the file deliv by deliv element and collects them into groups of $size:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math"
    version="3.0">

    <xsl:param name="size" as="xs:integer" select="1000"/>

    <xsl:mode on-no-match="shallow-copy" streamable="yes"/>

    <xsl:template match="root">
        <xsl:for-each-group select="activity/deliv" group-adjacent="(position() - 1) idiv $size">
            <xsl:result-document href="split-{format-number(current-grouping-key() + 1, '00000')}.xml" indent="yes">
                <root>
                    <activity>
                        <xsl:copy-of select="current-group()"/>
                    </activity>
                </root>
            </xsl:result-document>
        </xsl:for-each-group>
    </xsl:template>

</xsl:stylesheet>

That splits up your input into a number of files, each file having $size deliv elements (respectively the last one the remaining deliv elements if there are less than $size left).

Using Saxon EE requires obtaining a commercial license but trial licences exist.

Upvotes: 3

zx485
zx485

Reputation: 29022

XSLT-2.0 and above is a good fit for this task.
XSLT-3.0 even supports streaming.

The following stylesheet splits an XML file in a configurable amount of files using xsl:result-document.

It takes two parameters:

  • split - the number of items in each split
  • doc - the name of the source document

This is the XSLT-2.0 template customized to your example (split.xslt):

<?xml version="1.0"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xsl:param name="split" select="2" />        <!-- number of entries in each split -->
  <xsl:param name="doc" select="'src.xml'" />  <!-- name of source XML --> 

  <xsl:template match="/">  
    <xsl:variable name="cnt" select="xs:integer(count(document($doc)/root/activity/deliv) div xs:integer($split))" />    
    <xsl:value-of select="concat('#',$cnt,'#')" />
    <xsl:for-each select="0 to $cnt">
        <xsl:variable name="cur" select="xs:integer(.)" /> 
        <xsl:result-document method="xml" href="output_no_{$cur}.xml" exclude-result-prefixes="xs">
            <root>
                <activity>
                    <xsl:for-each select="document($doc)/root/activity/deliv[position() gt (xs:integer($split) * $cur) and position() le (xs:integer($split) * ($cur + 1))]">
                        <xsl:copy-of select="."/>
                    </xsl:for-each>
                </activity>
            </root>
        </xsl:result-document>
    </xsl:for-each>
  </xsl:template>

</xsl:stylesheet> 

With a current version of Saxon you can call it like this:

java -jar saxon9he.jar -xsl:split.xslt src.xml doc=src.xml split=2

Upvotes: 2

Gregory Ramsey
Gregory Ramsey

Reputation: 46

XSLT could do this job. I'd recommend getting your hands on an XSLT v2.0 processor so that you can use xsl:result-document. Then you'd need a little bit of logic to decide when to split between your files. You could base this off the position() of the deliv elements, or try using xsl:for-each-group to make groups that are sent to each file.

Upvotes: 1

Related Questions