SCH
SCH

Reputation: 11

Count distinct values in multiple XML-files with XQuery/XPath

I have several XML-files, which all have the same structure. I need to find all the distinct values that occur in each element and count each distinct occurrence.

What is the best way of doing this?

I’m using Oxygen, XPath/XQuery Builder (Saxon-HE XQuery 9.9.1.7)

What I have so far is:

let $collection := collection(file:///C:/MY_FOLDER?select=*.xml;recurse=yes)
for $val in distinct-values($collection//Element1/Element2)
for $doc in $collection
for $c in count(index-of($doc//Element1/Element2, $val))
order by document-url($doc)
where $c>0
return ("
", tokenize(document-url($doc),'/')last()], "______________",$val, "-", $c)

This sort of works… but it takes way way too long. I’m assuming there is a better way to do it?

Upvotes: 1

Views: 84

Answers (1)

Martin Honnen
Martin Honnen

Reputation: 167716

Test whether using grouping gives better performance e.g.

let $collection := collection('file:///C:/MY_FOLDER?select=*.xml;recurse=yes')
for $el2 in $collection//Element1/Element2
group by $val := $el2, $uri := document-uri(root($el2))
order by $uri
return (tokenize($uri, '/')[last()] || ': ' || $val || '; ' || count($el2))

I have tried to generate some arbitrary sample data and to run some tests (with Saxon 12.5 though), interestingly there an XSLT 3 based grouping

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    exclude-result-prefixes="xs math"
    version="3.0">
    
    <xsl:output method="text" item-separator="&#10;"/>
    
    <xsl:template name="xsl:initial-template">
        <xsl:for-each-group select="collection('.?select=*.xml;recurse=yes')//element1/element2" composite="yes" group-by="string(), base-uri(root(.))">
            <xsl:sort select="current-grouping-key()[2]"/>
            <xsl:sequence select="tokenize(current-grouping-key()[2], '/')[last()] || ': ' || current-grouping-key()[1] || '; ' || count(current-group())"/>
        </xsl:for-each-group>
    </xsl:template>
    
</xsl:stylesheet>

seems faster than the XQuery grouping I posted in this answer. And my XQuery grouping seems slower than your XQuery distinct-values attempt. Your mileage may vary.

As you seem to only want the count for each document in the collection I have tried to use uri-collection instead of collection, then you can also use multithreading with Saxon EE and saxon:threads and that speeds things up considerably (Saxon EE 12 77ms vs HE 119ms):

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    xmlns:math="http://www.w3.org/2005/xpath-functions/math"
    xmlns:saxon="http://saxon.sf.net/"
    exclude-result-prefixes="#all"
    version="3.0">
    
    <xsl:param name="collection-uri-param" as="xs:string" expand-text="no">.?select=*.xml;recurse=yes</xsl:param>
    
    <xsl:variable name="uri-collection" select="uri-collection($collection-uri-param) => sort()"/>
    
    <xsl:output method="text" item-separator="&#10;"/>
    
    <xsl:template name="xsl:initial-template">
        <xsl:for-each select="$uri-collection" saxon:threads="8">
            <xsl:variable name="uri" select="."/>
            <xsl:for-each-group select="doc(.)//element1/element2" group-by="string()">
                <xsl:sequence select="tokenize($uri, '/')[last()] || ': ' || current-grouping-key() || '; ' || count(current-group())"/>
            </xsl:for-each-group>            
        </xsl:for-each>
    </xsl:template>
    
</xsl:stylesheet>

Upvotes: 0

Related Questions