Reputation: 35
I have a large xml document containing annotated speech transcripts. Following is a short fragment.
<?xml version="1.0" encoding="UTF-8"?>
<U>
<A/>
<C type="start" id="cb01s"/>
<P/>
<T>a</T>
<T>woman</T>
<P/>
<T>took</T>
<T>off</T>
<T>the</T>
<T>train</T>
<C type="end" id="cb02e"/>
<P/>
<T>but</T>
<P/>
<F/>
<RT>
<O>
<C type="start" id="cb03s"/>
<T>her</T>
<T>bag</T>
<P/>
<T>are</T>
</O>
<P/>
<E>
<C type="start" id="cb04s"/>
<T>her</T>
<T>bag</T>
<T>are</T>
</E>
</RT>
<P/>
<T>still</T>
<P/>
<T>in</T>
<T>the</T>
<T>train</T>
<C type="end" id="cb05e"/>
<PC>.</PC>
</U>
The basic task I need to do is to get the number of <T>
nodes between certain pairs of <C>
nodes. I've used the following stylesheet fragment to do this (illustrating with one specific pair of <C>
nodes).
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="U">
<xsl:variable name="start-node" select="descendant::C[@id = 'cb01s']"/>
<xsl:variable name="end-node" select="descendant::C[@id = 'cb02e']"/>
<xsl:text>Result: </xsl:text>
<xsl:value-of select="count($start-node/following::T[following::C[generate-id(.) = generate-id($end-node)]])"/>
</xsl:template>
</xsl:stylesheet>
This works fine on such a short XML fragment as above and gives the correct result: Result: 6
.
However, the actual XML document contains tens of thousands of <C>
nodes and even more <T>
nodes. So when I try to run the stylesheet on it the result comes back very slowly. (It would probably take days to finish completely.) I suppose the problem must be that on each run of the <xsl:value-of...
line, the processor (Saxon) is checking all <T>
nodes and generating id's for <C>
nodes multiples times (i.e., exponentially) and that slows everything down.
Is there a way to speed up the process while still using generate-id()? Or do I need to get the number of <T>
nodes with some alternate approach?
Upvotes: 1
Views: 138
Reputation: 181199
You do not need generate-id()
just to avoid matching <C>
elements intervening between the start and end nodes. You are matching <C>
elements by their id
attributes in the first place, and I see no reason not to use that more directly. For example,
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" encoding="UTF-8"/>
<xsl:template match="U">
<xsl:variable name="start-id" select="cb01s"/>
<xsl:variable name="end-id" select="cb02e"/>
<xsl:text>Result: </xsl:text>
<xsl:value-of select="count(descendant::C[@id = $start-id]/following::T[following::C[@id = $end-id][1]])"/>
</xsl:template>
</xsl:stylesheet>
You can simplify that by removing the [1]
position predicate if you can rely on the <C>
element @id
s to be unique in the document.
If generate-id()
is indeed the primary cause of your performance problem, then avoiding it altogether ought to provide a big boost.
Upvotes: 1