Reputation: 27
I have a repository of circa 600K XML files and need to audit all the possible values of <country name="Country_Name">
and these files are generally small, 80k max filesize.
<country name="Country_Name">
is held in the metadata section of the XML near the top before the main body of the article, would using a streaming approach vastly improve the performance of the extracting the county value as streaming would be stopped as soon as that element was discovered by the Xpath?
Is this the right approach and if so, I'm running the Professional Edition of Saxon and would like to know if streaming lots of small documents warrants upgrading to Enterprise Edition.
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:mode streamable="yes"/>
<xsl:template match="/">
<xsl:for-each-stream select="doc('file:///C:/temp/*.xml')">
<xsl:value-of select="//country"/>
<!-- Replace the following with your code to dump the current file -->
<xsl:message select="concat('Processing ', document-uri(.))"/>
</xsl:for-each-stream>
</xsl:template>
</xsl:stylesheet>
Expected A message of each printed to the screen.
Upvotes: 0
Views: 252
Reputation: 167716
I think you can use e.g.
<xsl:template name="xsl:initial-template">
<xsl:for-each select="uri-collection('file:///C:/SomePath/SomeFolder?select=*.xml')">
<xsl:message expand-text="yes">Processing file {.}</xsl:message>
<xsl:source-document href="{.}" streamable="yes">
<xsl:value-of select="descendant::country[@name = 'Country_Name'][1]"/>
</xsl:source-document>
</xsl:for-each>
</xsl:template>
To use streaming, you need Saxon EE; you might want to request a trial license from Saxonica if you want to check/test whether that approach performs better than traditional XSLT 3.
I am not sure, however, the posted code stops processing the file after finding that first element; would need to test/check; there is an xsl:iterate
/xsl:break
where Saxon EE I think does an early exit.
Following the documentation at https://www.saxonica.com/html/documentation11/streaming/partial-reading.html I have tried code along the lines of
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
exclude-result-prefixes="#all"
version="3.0">
<xsl:param name="uri" as="xs:string" expand-text="no">file:///C:/SomePath/SomeFolder?select=*.xml</xsl:param>
<xsl:param name="items-to-select" as="xs:integer" select="1"/>
<xsl:template name="xsl:initial-template">
<xsl:for-each select="uri-collection($uri)">
<xsl:message expand-text="yes">Processing file {.}</xsl:message>
<xsl:source-document href="{.}" streamable="yes">
<xsl:iterate select="outermost(descendant::country[@name = 'Country_Name'])">
<xsl:value-of select="."/>
<xsl:if test="position() eq $items-to-select">
<xsl:break/>
</xsl:if>
</xsl:iterate>
</xsl:source-document>
</xsl:for-each>
</xsl:template>
<xsl:output method="text" item-separator=" "/>
</xsl:stylesheet>
and it passes the streamability analysis and also indicates an early exit (although only once when I process three files (without the -t
option); indeed three times if I use the -t
option and process three files: this is with 11.5 EE; I have now also tested with 12.0 EE and it outputs SXQP0001 The input file has not been read to completion
for each processed file without using the -t
option).
So that would mean, if you have files where that element is early in all or lots of documents, that parsing should be abandoned once the element has been found.
In some simple tests it looks as if a streamed, early exit solution run through Saxon 12.0 EE is faster than the unstreamed one if files are sufficiently large and the element to look for is at some point early in each document.
Upvotes: 0
Reputation: 163595
Normally I advise people that streaming doesn't make execution faster, its purpose is to reduce memory usage. It's possible that the "early exit" could make a difference, but only measurement will tell. With documents under 100K, I believe that quite a lot of the XML parsing cost is initialisation of the parse, so an early exit might not save as much as one hopes.
With this kind of workload the biggest gain from using Saxon-EE might come from parallel processing of the input documents. The collection()
function under Saxon-EE does parallel parsing automatically, but combining this with streaming might take a bit more thought.
Upvotes: 0