jbrehr
jbrehr

Reputation: 815

XQUERY: optimizing a query that touches large collections

I have the following XQUERY running in ExistDB (against XML documents that follow the TEI schema):

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

let $data-collection := "/db/apps/deheresi/resources/documents"
let $people-collection := "/db/apps/deheresi/resources/documents/codes_people.xml"

for $msdoc in collection($data-collection)/tei:TEI[contains(@xml:id,'ms609')]

for $ordinal in $msdoc/tei:text/tei:front//tei:div[@type='registry_ordinal']/replace(@n, '#', '')

for $doctype in $msdoc/tei:text/tei:front//tei:div[@type='doc_type']/replace(@subtype, '#', '')

for $folio in $msdoc/tei:text/tei:front//tei:div[@type='folio']/replace(@n, '#', '')

for $nameref in $msdoc/tei:text/tei:body[1]/tei:p[1]/tei:seg[1]/tei:persName[@role = 'dep']/replace(@nymRef, '#', '') 

for $persname in normalize-space(string-join(doc($people-collection)//tei:person[@xml:id = $nameref]))

return concat('<td>',$ordinal,'</td><td>',$folio,'</td><td>',$doctype,'</td><td>',$persname,'</td>')

Organization of XML documents:

The function does the following:

  1. get the identifying tei:TEI/@xml:id and the tei:persName[@role="dep"]/@nymRef from each xml document

  2. With the tei:persName[@role="dep"]/@nymRef I look up the the name in codes_people.xml/tei:person/xml:id="unique_foo_name"

This all returns the expected results...except it's really, really slow (4 seconds). Obviously I'm testing on a local computer and not a server, but I would like to optimize the queries before testing on more powerful servers.

ADDED PER REQUEST:

ExistDB version : 3.3.0

Sample output (the eventual target is an HTML table)

<td>0001</td><td>1r</td><td>Deposition</td><td>Arnald Garnier</td> 
<td>0002</td><td>1r</td><td>Deposition</td><td>Guilhem de Rosengue</td> 
<td>0003</td><td>1r</td><td>Deposition</td><td>Hugo de Mamiros</td> 
<td>0004</td><td>1r</td><td>Deposition</td><td>P Lapassa senior</td>

Many thanks in advance.

EDIT: I've added more information in a self-response below, and a link to all the files in Dropbox in the comments.

Upvotes: 0

Views: 307

Answers (2)

duncdrum
duncdrum

Reputation: 733

So there are a few problems with your code, that impact performance. The first is the fact that you process strings instead of xml paths. e.g. when using replace() instead of of some/@path[. = 'xyz']. Simply by using fn:id() instead of replace() will cut execution times to under 1 second.

The second is a missing xmlschema namespace declaration in your index config file, not that those indexes are used, since you are forcing exist to process strings instead of xml.

The third is the fact that your xquery code does not return a well-formed xml fragment, always a bad idea for performance reasons.

xquery version "3.1";
declare namespace tei="http://www.tei-c.org/ns/1.0";

declare variable $data-collection := "/db/apps/so-52709411/data";

(:always return a well-formed fragment:)
 <table>{

 let $people-collection := doc($data-collection || "/codes_people.xml")

 let $msdoc := collection($data-collection)//tei:TEI[contains(@xml:id,'ms609')]

 for $n in $msdoc
 let $registry := $n//tei:div[@type='registry_ordinal']/data(@n)
 let $type := $n//tei:div[@type='doc_type']/data(@subtype)
 let $folio := $n//tei:div[@type='folio']/data(@n)
 let $nym := substring-after($n//tei:persName[@role = 'dep']/data(@nymRef), '#') 
 let $persName := $people-collection//id($nym)/tei:persName

 return
<tr>
<td>{$registry}</td>
<td>{$type}</td>
<td>{$folio}</td>
<td>{$persName/string()
}</td>
</tr>

}
 </table>

combined with

<collection xmlns="http://exist-db.org/collection-config/1.0">
<index xmlns:tei="http://www.tei-c.org/ns/1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <range>
       <create qname="tei:persName" type="xs:string"/>
       <create qname="tei:person" type="xs:string"/>
       <create qname="@type" type="xs:string"/>
       <create qname="@role" type="xs:string"/>
       <create qname="@nymRef" type="xs:string"/>
    </range>
</index>
<triggers>
    <trigger class="org.exist.extensions.exquery.restxq.impl.RestXqTrigger"/>
</triggers>
 </collection>

results in indexes actually being usable

/Users/HALmob/pCloud Drive/Screenshots/Screen Shot 2018-10-15 at 01.30.42.png

but the data sample isn't really big enough to make much of a performance impact beyond the rewritten xquery. So even without indexes you should be running in the less then 1s range (depending on memory, hardware etc, YMMV)

you can download a working app running with your code here

Upvotes: 2

jbrehr
jbrehr

Reputation: 815

I've tried to simplify the Xquery by replacing certain for loops with let and concat():

xquery version "3.1";

declare namespace tei="http://www.tei-c.org/ns/1.0";

declare variable $people-collection := doc("/db/apps/deheresi/resources/documents/codes_people.xml");

let $data-collection := "/db/apps/deheresi/resources/documents"

for $msdoc in collection($data-collection)/tei:TEI[contains(@xml:id,'ms609')]

    let $concat1 := concat('<td>',
               $msdoc//tei:div[@type='registry_ordinal']/replace(@n, '#', ''), 
               '</td><td>', 
               $msdoc//tei:div[@type='doc_type']/replace(@subtype, '#', ''), 
               '</td><td>',
               $msdoc//tei:div[@type='folio']/replace(@n, '#', ''),
               '</td><td>')

    (:  obtain the attribute value of persName[@role = 'dep']/@nymRef   :)
    let $nameref := $msdoc//tei:persName[@role = 'dep']/replace(@nymRef, '#', '') 

    (:  now use the attribute value to lookup a printable name using xml:id in document codes_people.xml :)
    let $persname := normalize-space(string-join($people-collection//tei:person[@xml:id = $nameref]))

return concat($concat1,$persname,'</td>')

These adjustments removed .5 seconds from the query execution time (now 3.5 seconds).

If I strip out the final lookup ($persname) the query executes in .17 seconds. The lookup to the file codes_people.xml appears to be the bottle neck.

EDIT: I've added the following indexes affecting the relevant elements and they have not produced any optimisations

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:tei="http://www.tei-c.org/ns/1.0">
        <range>
           <create qname="tei:persName" type="xs:string"/>
           <create qname="tei:person" type="xs:string"/>
        </range>
    </index>
    <triggers>
        <trigger class="org.exist.extensions.exquery.restxq.impl.RestXqTrigger"/>
    </triggers>
</collection>

View from Query Profiler:

View from Query Profiler

Upvotes: 0

Related Questions