Swapper
Swapper

Reputation: 83

Fastest way to split large xml file per tag

I would like to split a very large XML file (20 GB) into small XML files using Python based on the first level tag name

This is the source XML file structure.

<?xml version="1.0" encoding="UTF-8"?>
<objects>
     <Client id="1">
          <Name>"John"</Name>
          ...
     </Client>
     <Client id="2">
          <Name>"Bob"</Name>
          ...
     </Client>
     <ClientAdditionnalInfo id="1">
          <Address> </Address>
          <Number>  </Number>
     </ClientAdditionnalInfo>
     <ClientAdditionnalInfo id="2">
          <Address> </Address>
          <Number>  </Number>
     </ClientAdditionnalInfo>
     ...
     ...
     ...
     <ClientInvoices>
          <InvoiceNumber>text</InvoiceNumber>
          <InvoiceDate>text</InvoiceDate>
     </ClientInvoices>
     ...
     <Client id="3">
          <Name>"Jenny"</Name>
          ...
     </Client>
          ...
          ...

I would like to get as many XML files as first level tags (Client.xml, ClientAdditionnalInfo.xml, ClientInvoices.xml, ...).

Client.xml should look like this:

<?xml version="1.0" encoding="UTF-8"?>
     <Client id="1">
          <Name>"John"</Name>
          ...
     </Client>
     <Client id="2">
          <Name>"Bob"</Name>
          ...
     </Client>

The file has more than 525 million lines and I don't have a list of the tags.

This is my code in Python, but it creates a file for each tag (n-level) and it overwrites existing files...

import xml.etree.ElementTree as ET

tree = ET.iterparse('filename.xml', events=('end', ))

for event, elem in tree:
    filename = format(elem.tag + ".xml")
    with open(filename, 'wb') as f:
        f.write(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
        f.write(ET.tostring(elem))

What would be the most efficient way to split the large XML file ?

Upvotes: 0

Views: 855

Answers (1)

Michael Kay
Michael Kay

Reputation: 163595

Assuming the groups of top-level elements are always adjacent, and not interleaved, the following XSLT 3.0 streaming transformation will do the job:

<xsl:mode streamable="yes"/>

<xsl:template match="objects">
  <xsl:for-each-group select="*" group-adjacent="local-name()">
     <xsl:result-document href="{current-grouping-key()}.xml">
       <xsl:copy-of select="current-group()"/>
     </xsl:result-document>
  </xsl:for-each-group>
</xsl:template>

Disclaimer: in practice the only streaming XSLT 3.0 engine currently available is Saxon-EE, which is my company's product, and is paid-for software.

You have asked for output files that are not well-formed documents because they contain no outermost wrapper element, and that's what my code will deliver. The output will probably be more useful if you add a wrapper element.

It's likely to run at about 1 minute per gigabyte, depending on your hardware.

Upvotes: 2

Related Questions