jbrehr
jbrehr

Reputation: 815

XQuery - optimizing inefficient query strategy (in eXist-db)

Environment: eXist-DB 4.4 / Xquery 3.1

I have hundreds of tei:xml documents in which are encoded named entities persName and placeName. The documents are in

 collection("db/fooapp/data")

Each instance of persName and placeName has an attribute @nymRef which contains a single value that refers to an xml:id in a master documents:

 db/fooapp/data/codes_persons.xml

 db/fooapp/data/codes_places.xml

These master documents contain, among other things, the canonical name of each person or place.

I am frequently doing single lookups for a certain single name, for example

let $x := some @nymRef

let $y := doc(db/fooapp/data/codes_places.xml)//tei:place[@xml:id=$x]//tei:placeName/text()

return $y

But, there are times where I need to do this, cycling through huge lists. For example, across all the documents I need to output an id for a seg and it has a (or multiple) child element placeName/@nymRef:

 <seg xml:id="fooref">some text<placeName nymRef="fooplace"/>some text</seg>

The task is to obtain all the seg/@xml:id and then lookup and output the canonical name of any placeName/@nymRef underneath it. This results in numerous round trips that are really inefficient, but I do not know any other means to do this in eXist-DB. The costly roundtrip is expressed at let $c, cycling through return:

let $coll := collection("db/fooapp/data")

for $a in $coll//seg

    for $b in $a//placeName

        let $c := $doc("db/fooapp/data/codes_places.xml")//tei:place[@xml:id=$b/data(@nymRef)]//tei:placeName/text()

        return 
              <tr>
                <td>{$a/@xml:id}</td>
                <td>{$c}</td>
              </tr>

This can add up to hundreds of round trips for a single table output.

I have no objections to restructuring the task into multiple functions if necessary.

Many thanks in advance.

Upvotes: 0

Views: 76

Answers (1)

duncdrum
duncdrum

Reputation: 733

Please provide us with an input xml and the desired output, otherwise there is no way to rewrite your query. We also need to see your index configuration.

Some general advice, for avoiding roundtrips:

  • First off, see my previous answer to your question on the use of ft:query(). When doing [@xml:id=$b/data(@nymRef)] is exist using indexes or are you forcing it to do a string comparison without having an index configured on that string?

  • id() is the fastest way possible to lookup xml:id values

  • distinct-values is your friend to only look-up each distinct key:value pair once.

  • Use a single for loop to avoid iterating over the same data multiple times.

  • Whenever possible go for more restrictive XPath expressions, // probably loads a lot of unnecessary data into memory.

All of these and more can be found in the documentation

Upvotes: 1

Related Questions