Reputation: 95
I'm unable to flatten and transform an XML to CSV file using XSLT, when dealing with large XML files.
Currently, I'm parsing a nested XML file with lxml
using a XSL file to flatten the output and then I write the output to a CSV file.
My XML looks something like this:
<root>
<level1>
<level2>
<topid>1</topid>
<level3>
<subtopid>1</topid>
<level4>
<subid>1</id>
<descr>test</descr>
</level4>
<level4>
<subid>2</id>
<descr>test2</descr>
</level4>
...
</level3>
...
</level2>
</level1>
</root>
I want to end up with the following CSV file:
topid,subtopid,subid,descr
1,1,1,test
1,1,2,test2
....
My XSLT:
<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="utf-8" use-character-maps="map"/>
<xsl:character-map name="map">
<xsl:output-character character="," string=" "/>
</xsl:character-map>
<xsl:strip-space elements="*"/>
<xsl:variable name="delimiter" select="','"/>
<xsl:variable name="newline" select="'
'" />
<xsl:template match="/root">
<xsl:text>topid,subtopid,subid,descr</xsl:text>
<xsl:value-of select="$newline" />
<xsl:for-each select="level1/level2/level3/level4">
<xsl:value-of select="ancestor::root/level1/level2/topid" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="ancestor::root/level1/level2/level3/subtopid" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="subid" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="descr" />
<xsl:value-of select="$newline" />
</xsl:for-each>
</xsl:template>
My Python code:
import lxml.etree as ET
xsltfile = ET.XSLT(ET.parse('transactions.xsl'))
xmlfile = ET.parse('myxmlfile.xml')
output = xsltfile(xmlfile).write_output('output.csv')
This works great for small files, but now I want to do the same with an XML file of +- 2.5gb. Using etree.parse will load it into memory, which won't work with larger files obviously.
I want to iterate somewhere, so I'm not loading the XML file into memory and write to CSV line for line, while still making use of the XSLT for transforming. I'm using the XSLT file because it's the only way I know (now) how to flatten a nested XML file.
Upvotes: 1
Views: 5227
Reputation: 771
Saxon/C and python can work:
One user has successfully used Boost.Python to interface with the C++ library.
Another user has done the interfacing in a different way: https://github.com/ajelenak/pysaxon
Upvotes: 0
Reputation: 52858
I'd rather use XSLT 3.0 (or even 2.0!) in Python, but haven't had the time to figure out how to use Saxon/C.
Another option would be to use iterparse()
.
Example...
XML Input (fixed to be well-formed and to add a second level3
for testing)
<root>
<level1>
<level2>
<topid>1</topid>
<level3>
<subtopid>1</subtopid>
<level4>
<subid>1</subid>
<descr>test</descr>
</level4>
<level4>
<subid>2</subid>
<descr>test2</descr>
</level4>
</level3>
<level3>
<subtopid>2</subtopid>
<level4>
<subid>1</subid>
<descr>test</descr>
</level4>
<level4>
<subid>2</subid>
<descr>test2</descr>
</level4>
</level3>
</level2>
</level1>
</root>
Python
from lxml import etree
import csv
context = etree.iterparse("test.xml", events=("start", "end"))
fields = ("topid", "subtopid", "subid", "descr")
with open("test.csv", "w", newline="", encoding="utf8") as xml_data_to_csv:
csv_writer = csv.DictWriter(xml_data_to_csv, fieldnames=fields,
delimiter=",", quoting=csv.QUOTE_MINIMAL)
csv_writer.writeheader()
topid = None
subtopid = None
values = {}
for event, elem in context:
tag = elem.tag
text = elem.text
if tag == "topid" and text:
topid = text
if tag == "subtopid" and text:
subtopid = text
if tag == "subid" and text:
values["subid"] = text
if tag == "descr" and text:
values["descr"] = text
if event == "start" and tag == "level4":
# Build a dict containing all of the "fields" with default values of "Unknown".
values = {key: "Unknown" for key in fields}
if event == "end" and tag == "level4":
values["topid"] = topid
values["subtopid"] = subtopid
csv_writer.writerow(values)
elem.clear()
CSV Output
topid,subtopid,subid,descr
1,1,1,test
1,1,2,test2
1,2,1,test
1,2,2,test2
Upvotes: 3
Reputation: 163322
One possibility is to use XSLT 3.0 streaming. There are two challenges here:
(a) making your code streamable. We can't judge how difficult that is without seeing the stylesheet code.
(b) installing and running a streaming XSLT 3.0 processor. This depends how locked in to the Python environment you are. If it has to be done in Python, you could try installing Saxon/C. The alternative is to call out to a different environment in which case you have more options, for example you could run Saxon-EE on Java.
LATER
Looking at the code you have posted, it's rather strange
<xsl:for-each select="level1/level2/level3/level4">
<xsl:value-of select="ancestor::root/level1/level2/topid" />
I suspect you want to output the topid
of the "current" level2
element, but that's not what this is doing (in XSLT 1.0 it will print the value of the first level2/topic
, in XSLT 2.0+ is will print the values of all the level2/topic
elements. I suspect you really want something like this:
<xsl:for-each select="level1/level2/level3/level4">
<xsl:value-of select="ancestor::level2/topid" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="ancestor::level3/subtopid" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="subid" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="descr" />
<xsl:value-of select="$newline" />
</xsl:for-each>
That's almost streamable, but not quite. Streaming doesn't allow you to go back to the topid and subtopid elements. The easiest way to make it streamable might be to save the most recent values of these elements in accumulators:
<xsl:accumulator name="topid" as="xs:string" initial-value="''">
<xsl:accumulator-rule match="topid/text()" select="string(.)"/>
</xsl:accumulator>
<xsl:accumulator name="subtopid" as="xs:string" initial-value="''">
<xsl:accumulator-rule match="subtopid/text()" select="string(.)"/>
</xsl:accumulator>
and then access the values as:
<xsl:for-each select="level1/level2/level3/level4">
<xsl:value-of select="accumulator-before('topid')" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="accumulator-before('subtopid')" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="subid" />
<xsl:value-of select="$delimiter" />
<xsl:value-of select="descr" />
<xsl:value-of select="$newline" />
</xsl:for-each>
Upvotes: 2