Reputation: 83
I would like to split a very large XML file (20 GB) into small XML files using Python based on the first level tag name
This is the source XML file structure.
<?xml version="1.0" encoding="UTF-8"?>
<objects>
<Client id="1">
<Name>"John"</Name>
...
</Client>
<Client id="2">
<Name>"Bob"</Name>
...
</Client>
<ClientAdditionnalInfo id="1">
<Address> </Address>
<Number> </Number>
</ClientAdditionnalInfo>
<ClientAdditionnalInfo id="2">
<Address> </Address>
<Number> </Number>
</ClientAdditionnalInfo>
...
...
...
<ClientInvoices>
<InvoiceNumber>text</InvoiceNumber>
<InvoiceDate>text</InvoiceDate>
</ClientInvoices>
...
<Client id="3">
<Name>"Jenny"</Name>
...
</Client>
...
...
I would like to get as many XML files as first level tags (Client.xml, ClientAdditionnalInfo.xml, ClientInvoices.xml, ...).
Client.xml should look like this:
<?xml version="1.0" encoding="UTF-8"?>
<Client id="1">
<Name>"John"</Name>
...
</Client>
<Client id="2">
<Name>"Bob"</Name>
...
</Client>
The file has more than 525 million lines and I don't have a list of the tags.
This is my code in Python, but it creates a file for each tag (n-level) and it overwrites existing files...
import xml.etree.ElementTree as ET
tree = ET.iterparse('filename.xml', events=('end', ))
for event, elem in tree:
filename = format(elem.tag + ".xml")
with open(filename, 'wb') as f:
f.write(b"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n")
f.write(ET.tostring(elem))
What would be the most efficient way to split the large XML file ?
Upvotes: 0
Views: 855
Reputation: 163595
Assuming the groups of top-level elements are always adjacent, and not interleaved, the following XSLT 3.0 streaming transformation will do the job:
<xsl:mode streamable="yes"/>
<xsl:template match="objects">
<xsl:for-each-group select="*" group-adjacent="local-name()">
<xsl:result-document href="{current-grouping-key()}.xml">
<xsl:copy-of select="current-group()"/>
</xsl:result-document>
</xsl:for-each-group>
</xsl:template>
Disclaimer: in practice the only streaming XSLT 3.0 engine currently available is Saxon-EE, which is my company's product, and is paid-for software.
You have asked for output files that are not well-formed documents because they contain no outermost wrapper element, and that's what my code will deliver. The output will probably be more useful if you add a wrapper element.
It's likely to run at about 1 minute per gigabyte, depending on your hardware.
Upvotes: 2