fildpauz
fildpauz

Reputation: 35

generate-id() too slow for large document

I have a large xml document containing annotated speech transcripts. Following is a short fragment.

<?xml version="1.0" encoding="UTF-8"?>
<U>
    <A/>
    <C type="start" id="cb01s"/>
    <P/>
    <T>a</T>
    <T>woman</T>
    <P/>
    <T>took</T>
    <T>off</T>
    <T>the</T>
    <T>train</T>
    <C type="end" id="cb02e"/>
    <P/>
    <T>but</T>
    <P/>
    <F/>
    <RT>
        <O>
            <C type="start" id="cb03s"/>
            <T>her</T>
            <T>bag</T>
            <P/>
            <T>are</T>
        </O>
        <P/>
        <E>
            <C type="start" id="cb04s"/>
            <T>her</T>
            <T>bag</T>
            <T>are</T>
        </E>
    </RT>
    <P/>
    <T>still</T>
    <P/>
    <T>in</T>
    <T>the</T>
    <T>train</T>
    <C type="end" id="cb05e"/>
    <PC>.</PC>
</U>

The basic task I need to do is to get the number of <T> nodes between certain pairs of <C> nodes. I've used the following stylesheet fragment to do this (illustrating with one specific pair of <C> nodes).

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" encoding="UTF-8"/>

    <xsl:template match="U">
        <xsl:variable name="start-node" select="descendant::C[@id = 'cb01s']"/>
        <xsl:variable name="end-node" select="descendant::C[@id = 'cb02e']"/>
        <xsl:text>Result: </xsl:text>
        <xsl:value-of select="count($start-node/following::T[following::C[generate-id(.) = generate-id($end-node)]])"/>
    </xsl:template>

</xsl:stylesheet>

This works fine on such a short XML fragment as above and gives the correct result: Result: 6.

However, the actual XML document contains tens of thousands of <C> nodes and even more <T> nodes. So when I try to run the stylesheet on it the result comes back very slowly. (It would probably take days to finish completely.) I suppose the problem must be that on each run of the <xsl:value-of... line, the processor (Saxon) is checking all <T> nodes and generating id's for <C> nodes multiples times (i.e., exponentially) and that slows everything down.

Is there a way to speed up the process while still using generate-id()? Or do I need to get the number of <T> nodes with some alternate approach?

Upvotes: 1

Views: 138

Answers (1)

John Bollinger
John Bollinger

Reputation: 181199

You do not need generate-id() just to avoid matching <C> elements intervening between the start and end nodes. You are matching <C> elements by their id attributes in the first place, and I see no reason not to use that more directly. For example,

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="text" encoding="UTF-8"/>

    <xsl:template match="U">
        <xsl:variable name="start-id" select="cb01s"/>
        <xsl:variable name="end-id" select="cb02e"/>

        <xsl:text>Result: </xsl:text>
        <xsl:value-of select="count(descendant::C[@id = $start-id]/following::T[following::C[@id = $end-id][1]])"/>
    </xsl:template>

</xsl:stylesheet>

You can simplify that by removing the [1] position predicate if you can rely on the <C> element @ids to be unique in the document.

If generate-id() is indeed the primary cause of your performance problem, then avoiding it altogether ought to provide a big boost.

Upvotes: 1

Related Questions