kilar
kilar

Reputation: 95

How to parse huge XML with Python and XSLT file iteratively and write to CSV

I'm unable to flatten and transform an XML to CSV file using XSLT, when dealing with large XML files.

Currently, I'm parsing a nested XML file with lxml using a XSL file to flatten the output and then I write the output to a CSV file.

My XML looks something like this:

<root>
    <level1>
        <level2>
            <topid>1</topid>
            <level3>
                <subtopid>1</topid>
                <level4>
                    <subid>1</id>
                    <descr>test</descr>
                </level4>
                <level4>
                    <subid>2</id>
                    <descr>test2</descr>
                </level4>
                ...
            </level3>
            ...
        </level2>
    </level1>
</root>

I want to end up with the following CSV file:

topid,subtopid,subid,descr
1,1,1,test
1,1,2,test2
....

My XSLT:

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="utf-8" use-character-maps="map"/>
<xsl:character-map name="map">
    <xsl:output-character character="," string=" "/>
</xsl:character-map>

<xsl:strip-space elements="*"/>
<xsl:variable name="delimiter" select="','"/>
<xsl:variable name="newline" select="'&#xd;'" />

<xsl:template match="/root">
    <xsl:text>topid,subtopid,subid,descr</xsl:text>
    <xsl:value-of select="$newline" />

    <xsl:for-each select="level1/level2/level3/level4">
        <xsl:value-of select="ancestor::root/level1/level2/topid" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="ancestor::root/level1/level2/level3/subtopid" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="subid" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="descr" />
        <xsl:value-of select="$newline" />
    </xsl:for-each>
</xsl:template>

My Python code:

import lxml.etree as ET

xsltfile = ET.XSLT(ET.parse('transactions.xsl'))
xmlfile  = ET.parse('myxmlfile.xml')
output   = xsltfile(xmlfile).write_output('output.csv')

This works great for small files, but now I want to do the same with an XML file of +- 2.5gb. Using etree.parse will load it into memory, which won't work with larger files obviously.

I want to iterate somewhere, so I'm not loading the XML file into memory and write to CSV line for line, while still making use of the XSLT for transforming. I'm using the XSLT file because it's the only way I know (now) how to flatten a nested XML file.

Upvotes: 1

Views: 5227

Answers (3)

ond1
ond1

Reputation: 771

Saxon/C and python can work:

One user has successfully used Boost.Python to interface with the C++ library.

Another user has done the interfacing in a different way: https://github.com/ajelenak/pysaxon

Upvotes: 0

Daniel Haley
Daniel Haley

Reputation: 52858

I'd rather use XSLT 3.0 (or even 2.0!) in Python, but haven't had the time to figure out how to use Saxon/C.

Another option would be to use iterparse().

Example...

XML Input (fixed to be well-formed and to add a second level3 for testing)

<root>
    <level1>
        <level2>
            <topid>1</topid>
            <level3>
                <subtopid>1</subtopid>
                <level4>
                    <subid>1</subid>
                    <descr>test</descr>
                </level4>
                <level4>
                    <subid>2</subid>
                    <descr>test2</descr>
                </level4>
            </level3>
            <level3>
                <subtopid>2</subtopid>
                <level4>
                    <subid>1</subid>
                    <descr>test</descr>
                </level4>
                <level4>
                    <subid>2</subid>
                    <descr>test2</descr>
                </level4>
            </level3>
        </level2>
    </level1>
</root>

Python

from lxml import etree
import csv

context = etree.iterparse("test.xml", events=("start", "end"))

fields = ("topid", "subtopid", "subid", "descr")

with open("test.csv", "w", newline="", encoding="utf8") as xml_data_to_csv:

    csv_writer = csv.DictWriter(xml_data_to_csv, fieldnames=fields,
                                delimiter=",", quoting=csv.QUOTE_MINIMAL)

    csv_writer.writeheader()

    topid = None
    subtopid = None
    values = {}

    for event, elem in context:
        tag = elem.tag
        text = elem.text

        if tag == "topid" and text:
            topid = text

        if tag == "subtopid" and text:
            subtopid = text

        if tag == "subid" and text:
            values["subid"] = text

        if tag == "descr" and text:
            values["descr"] = text

        if event == "start" and tag == "level4":
            # Build a dict containing all of the "fields" with default values of "Unknown".
            values = {key: "Unknown" for key in fields}

        if event == "end" and tag == "level4":
            values["topid"] = topid
            values["subtopid"] = subtopid
            csv_writer.writerow(values)

        elem.clear()

CSV Output

topid,subtopid,subid,descr
1,1,1,test
1,1,2,test2
1,2,1,test
1,2,2,test2

Upvotes: 3

Michael Kay
Michael Kay

Reputation: 163322

One possibility is to use XSLT 3.0 streaming. There are two challenges here:

(a) making your code streamable. We can't judge how difficult that is without seeing the stylesheet code.

(b) installing and running a streaming XSLT 3.0 processor. This depends how locked in to the Python environment you are. If it has to be done in Python, you could try installing Saxon/C. The alternative is to call out to a different environment in which case you have more options, for example you could run Saxon-EE on Java.

LATER

Looking at the code you have posted, it's rather strange

<xsl:for-each select="level1/level2/level3/level4">
    <xsl:value-of select="ancestor::root/level1/level2/topid" />

I suspect you want to output the topid of the "current" level2 element, but that's not what this is doing (in XSLT 1.0 it will print the value of the first level2/topic, in XSLT 2.0+ is will print the values of all the level2/topic elements. I suspect you really want something like this:

    <xsl:for-each select="level1/level2/level3/level4">
        <xsl:value-of select="ancestor::level2/topid" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="ancestor::level3/subtopid" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="subid" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="descr" />
        <xsl:value-of select="$newline" />
    </xsl:for-each>

That's almost streamable, but not quite. Streaming doesn't allow you to go back to the topid and subtopid elements. The easiest way to make it streamable might be to save the most recent values of these elements in accumulators:

<xsl:accumulator name="topid" as="xs:string" initial-value="''">
  <xsl:accumulator-rule match="topid/text()" select="string(.)"/>
</xsl:accumulator>

<xsl:accumulator name="subtopid" as="xs:string" initial-value="''">
  <xsl:accumulator-rule match="subtopid/text()" select="string(.)"/>
</xsl:accumulator>

and then access the values as:

    <xsl:for-each select="level1/level2/level3/level4">
        <xsl:value-of select="accumulator-before('topid')" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="accumulator-before('subtopid')" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="subid" />
        <xsl:value-of select="$delimiter" />
        <xsl:value-of select="descr" />
        <xsl:value-of select="$newline" />
    </xsl:for-each>

Upvotes: 2

Related Questions