Reputation: 11
I have several XML-files, which all have the same structure. I need to find all the distinct values that occur in each element and count each distinct occurrence.
What is the best way of doing this?
I’m using Oxygen, XPath/XQuery Builder (Saxon-HE XQuery 9.9.1.7)
What I have so far is:
let $collection := collection(file:///C:/MY_FOLDER?select=*.xml;recurse=yes)
for $val in distinct-values($collection//Element1/Element2)
for $doc in $collection
for $c in count(index-of($doc//Element1/Element2, $val))
order by document-url($doc)
where $c>0
return ("
", tokenize(document-url($doc),'/')last()], "______________",$val, "-", $c)
This sort of works… but it takes way way too long. I’m assuming there is a better way to do it?
Upvotes: 1
Views: 84
Reputation: 167716
Test whether using grouping gives better performance e.g.
let $collection := collection('file:///C:/MY_FOLDER?select=*.xml;recurse=yes')
for $el2 in $collection//Element1/Element2
group by $val := $el2, $uri := document-uri(root($el2))
order by $uri
return (tokenize($uri, '/')[last()] || ': ' || $val || '; ' || count($el2))
I have tried to generate some arbitrary sample data and to run some tests (with Saxon 12.5 though), interestingly there an XSLT 3 based grouping
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
exclude-result-prefixes="xs math"
version="3.0">
<xsl:output method="text" item-separator=" "/>
<xsl:template name="xsl:initial-template">
<xsl:for-each-group select="collection('.?select=*.xml;recurse=yes')//element1/element2" composite="yes" group-by="string(), base-uri(root(.))">
<xsl:sort select="current-grouping-key()[2]"/>
<xsl:sequence select="tokenize(current-grouping-key()[2], '/')[last()] || ': ' || current-grouping-key()[1] || '; ' || count(current-group())"/>
</xsl:for-each-group>
</xsl:template>
</xsl:stylesheet>
seems faster than the XQuery grouping I posted in this answer. And my XQuery grouping seems slower than your XQuery distinct-values attempt. Your mileage may vary.
As you seem to only want the count for each document in the collection I have tried to use uri-collection
instead of collection
, then you can also use multithreading with Saxon EE and saxon:threads
and that speeds things up considerably (Saxon EE 12 77ms vs HE 119ms):
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:math="http://www.w3.org/2005/xpath-functions/math"
xmlns:saxon="http://saxon.sf.net/"
exclude-result-prefixes="#all"
version="3.0">
<xsl:param name="collection-uri-param" as="xs:string" expand-text="no">.?select=*.xml;recurse=yes</xsl:param>
<xsl:variable name="uri-collection" select="uri-collection($collection-uri-param) => sort()"/>
<xsl:output method="text" item-separator=" "/>
<xsl:template name="xsl:initial-template">
<xsl:for-each select="$uri-collection" saxon:threads="8">
<xsl:variable name="uri" select="."/>
<xsl:for-each-group select="doc(.)//element1/element2" group-by="string()">
<xsl:sequence select="tokenize($uri, '/')[last()] || ': ' || current-grouping-key() || '; ' || count(current-group())"/>
</xsl:for-each-group>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Upvotes: 0