Scar
Scar

Reputation: 148

XPath How to optimize performance over "preceding" axis?

I am using XSLT to transform XML files and this XPath is a very small part of it. The main object is a performance issue. First, I will describe the context:
Part of the transformation is a complex grouping operation, used to group up a sequence of similar elements, in the order they appear. This is a small sample from the data:

<!-- potentially a lot more data-->
<MeaningDefBlock>
    <!-- potentially a lot more data-->
    <MeaningSegment>
        <Meaning>
            <value> or </value>
        </Meaning>
    </MeaningSegment>
    <MeaningSegment>
        <MeaningInsert>
            <OpenBracket>
                <value>(</value>
            </OpenBracket>
            <Meaning>
                <value>ex.: </value>
            </Meaning>
            <IllustrationInsert>
                <value>ita, lics</value>
            </IllustrationInsert>
            <ClosedBracket>
                <value>)</value>
            </ClosedBracket>
        </MeaningInsert>
    </MeaningSegment>
    <!-- potentially a lot more data-->
</MeaningDefBlock>
<!-- potentially a lot more data-->

There are only parent elements (ex.: MeaningInsert) and elements that only contain a value element, which contains text (ex.: IllustrationInsert). The text from the input gets grouped into elements that have such text segments: or (ex.:, ita, lics and ) (in this case, the "ita, lics" segment separates the groups that would otherwise be all in one). The main point is that elements from different levels can be grouped. XPath is used to identify groups via previous segments and keyed in the XSL. The whole key is very complicated and not the object of the question (but I still provide it for context):
<xsl:key name="leavesInGroupL4" match="MeaningSegment//*[value]" use="generate-id(((preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[not(boolean(self::IllustrationInsert|self::LatinName)=boolean(current()/self::IllustrationInsert|current()/self::LatinName))]|ancestor-or-self::MeaningDefBlock)[last()])"/>
The important part being:
(preceding-sibling::*[value]|ancestor-or-self::MeaningSegment/preceding-sibling::MeaningSegment//*[value])[...]
From the context of an element with a value child (like Meaning or OpenBracket), this XPath selects the previous siblings and all the elements with values from the preceding siblings of the parent/ancestor MeaningSegment. In practice, it basically selects all the text that came before it (or, rather, the grandparent of the text itself)

I have later realized that there might be even further complications with layers and differing depth of the elements with values. I might need to select all such preceding elements regardless of their parent and siblings but still in the same block. I have substituted "the important part" with a somewhat simpler XPath expression:
preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock) = generate-id(current()/ancestor-or-self::MeaningDefBlock)]
This only checks that it's in the same block and it works! It successfully selects the preceding segments of text in the block, even if elements with values and parent elements are mixed together. Example input fragment:

...
<OpenBracket>
    <value>(</value>
</OpenBracket>
<SomeParentElement>
    <LatinName>
        <value>also italics</value>
    </LatinName>
</SomeParentElement>
<ClosedBracket>
    <value>)</value>
</ClosedBracket>
...

This is not something the first approach could do because the brackets and the LatinName are not siblings.
However, the new approach with preceding:* is extremely slow! On a real document, the XSL transformation takes up to 5 minutes instead of the usual 3 seconds that the original approach takes (including overhead), which is a 100x increase in time taken. Of course, that is because preceding checks nearly every node in the whole document when it is executed (a lot of times). The document has a lot of MeaningDefBlock blocks (nearly 2000), each with a couple segments of text (usually single-digit) and a bunch of other straight-forward elements/nodes unrelated to said text (usually in the low hundreds, each block). Quite easy to see how this all adds up to preceding trashing performance over preceding-sibling.

I was wondering if this could be optimized somehow. In XSL, keys have greatly improved performance multiple times in our project but I'm not sure if preceding and keys can be combined or if the XPath needs to be more complex and tailored to my specific case, perhaps enumerating the elements it should look at (and hopefully ignoring everything else).
Since the input will currently always work with the first approach, I have conceded and rolled back the change (and would probably rather take the 5 min hit every time than trying optimization myself).
I use XSLT 1.0 and XPath 1.0

Upvotes: 0

Views: 416

Answers (1)

Michael Kay
Michael Kay

Reputation: 163575

I guess you've probably already worked out that

    preceding::*[value and generate-id(ancestor-or-self::MeaningDefBlock)
 = generate-id(current()/ancestor-or-self::MeaningDefBlock)]

is going to search back to the beginning of the document; it's not smart enough to know that it only needs to search within the containing meaningDefBlock element.

One answer to that would be to change it to something like this:

ancestor-or-self::MeaningDefBlock//*[value][. << current()]

The << operator requires XPath 2.0 and for a problem as complex as this, you really ought to consider moving forwards. However you can simulate the operator in 1.0 with an expression like generate-id(($A|$B)[1]) = generate-id($A).

There's no guarantee this will be faster, but unlike your existing solution it should be independent of how many MeaningDefBlock elements there are in the document.

Upvotes: 1

Related Questions