Kevin Brown
Kevin Brown

Reputation: 8877

Is there a more efficient XQuery here for deletion?

I have a large eXist-db database in which I am implementing some functions to manage the data inside.

The pertinent information here is:

There are 2000 customers.
Each customer has access to 400 documents.
Now each document has 20 languages.

So I have 2000 XMLs, each have something like this:

<customer name="foo">
    <document num="A01" subscribed="Y">
      <languages>
        <lang subscribed="N">Arabic</lang>
        <lang subscribed="Y">Polish</lang>
... and so on for 400 documents for 20 languages ...

Now I an trying to write an xQuery because I need to revoke a language. This means I need to essentially delete in 2000 files, for every document, the <lang> whose name is defined.

So something like this:

 for $langs in $g:collection.customers//lang[text()=$deletelang]
 return update delete $langs

But this is taking forever and actually requires a huge amount of memory. Certainly in looking at it ... $langs would be 2000*400 = 800,000 items.

I would note that a range index exists which contains:

<create qname="lang" type="xs:string" nested="no"/>

But still, is that delete query so inefficient that essentially it cannot execute with 800,000 items or is there some other way that should be written?

Update I

So I changed things a bit to just see.

1) Per the comment, I changed to "." from text()

2) I added subsequence() surrounding this to test various sizes>>

for $langs in subsequence($g:collection.customers//lang[.=$deletelang],1,30000)

1-30000 = 24s

so running the whole collection now = 110s

Upvotes: 2

Views: 36

Answers (2)

westbaystars
westbaystars

Reputation: 151

It's been a while since I've had to do mass deletions. But at least in eXist version 2.2 I could do:

let $langs := $g:collection.customers//lang[.=$deletelang]
return update delete $langs

There was no need to loop through them all.

On a separate note, I would do a complete backup before issuing such mass deletions, just to make sure that everything goes right.

Hope this helps.

Upvotes: 0

wst
wst

Reputation: 11771

You would need to profile the query to determine exactly where it's spending the most time, but it's possible that your use of text() is short-circuiting the engine's ability to use the range index and is therefore loading all those documents into memory.

text() selects text nodes, and there may be more that one text node in an element. i.e: element lang { text { 'Pol' }, text { 'ish' } } will result in an element that looks like <lang>Polish</lang>, but has two text nodes and will fail on a predicate that assumes there is only one: [text() = 'Polish'].

Try using dot in your predicate: [. = $deletelang].

It's also possible that there is additional overhead in the database to committing such a large update at once. If it is not critical that the delete be done atomically, you may improve performance by batching the updates.

Upvotes: 1

Related Questions