Reputation: 8877

Issue in index upgrading eXist-db version 2 to 4.5

I am in process of upgrading and testing a large installation and have hit one issue I cannot understand. I have a large collection of documents in which my index is created as follows:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink">
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer">
                <param name="stopwords" type="org.apache.lucene.analysis.util.CharArraySet"/>
            </analyzer>
            <analyzer id="ws" class="org.apache.lucene.analysis.WhitespaceAnalyzer"/>
            <text qname="p"/>
            <text qname="li"/>
            <text qname="h1"/>
            <text qname="h2"/>
            <text qname="h3"/>
        </lucene>
    </index>
</collection>

In my version 2 installation this works perfect. A query returns only element in the list (p, li, h1, h2, h3). It also only returns those elements with the text in the element (as expected). The search function is:

declare function ls:ls($collection as xs:string, $phrase as xs:string) as element()* {
    for $hit in collection(xmldb:encode-uri($collection))//*[ft:query(.,
        <query>
            <phrase>{$phrase}</phrase>
        </query>
        )]
        order by $hit/ancestor::div[@class='content']/@doc/string()
        return 
            <tr>
                <td>
                    {$hit/ancestor::div[@class='content']/@doc/string()}
                </td>
                <td>
                    {$hit/ancestor::div[@class='content']/@title/string()}
                </td>
                <td>
                    {local-name($hit)}
                </td>
                <td class="hit_text">
                    {normalize-space($hit)}
                </td>
            </tr>
};

Just to see the result, here's a snapshot of the web page results:

Of course this is not showing all the results, but trust me ... it is only returning the named elements and only those with "heart" in them.

After export/import of the content to the new version 4 installation most everything else is working perfect. However, even after reindexing the content the exact same xQuery returns unwanted higher level elements (like div) and also returns elements which do not contain the search phrase.

For example, this exact same query shows this result:

Now, oddly enough, if I change the function to remove the wildcard and go only after "h1" (or any other of the named elements), it works:

for $hit in collection(xmldb:encode-uri($collection))//h1[ft:query(.,

Yields:

You can see that unlike the previous example, the h1 without "heart" is not returned.

What did I miss in my upgrade? is there some change to Lucene I missed or do not understand?

Update

As a hack (IMHO), this works:

let $targets := collection(xmldb:encode-uri($collection))//*[local-name(.) = 'p' or local-name(.) = 'h1' or local-name(.) = 'h2' or local-name(.) = 'h3' or local-name(.) = 'li']
    for $hit in $targets[ft:query(.,
        <query>
            <phrase>{$phrase}</phrase>
        </query>
        )]

But if I remove creating the nodeset $targets and put the collection() in the "for" then it does not work.

Update II

There must be something wrong (as in the full text is not enabled or running or ?) because running a similar query in both takes way longer in the new, updated server.

So what did I miss in the upgrade? I have conf.xml calling out Lucene in both. Any hints for what to look for would be great.

Update III

Maybe this in the logs is a problem? I doubt it as searching the log of the 2.x version shows the same error.

2018-12-19 19:27:05,570 [qtp14962548-143] ERROR (AnalyzerConfig.java [configureAnalyzer]:173) - Lucene index: analyzer class org.apache.lucene.analysis.WhitespaceAnalyzer not found. (org.apache.lucene.analysis.WhitespaceAnalyzer) 
2018-12-19 19:27:38,852 [qtp14962548-43] INFO  (NativeBroker.java [reindexCollection]:1844) - Start indexing collection /db/EIDO/data/Core 
2018-12-19 19:27:54,837 [qtp14962548-43] INFO  (NativeBroker.java [reindexCollection]:1854) - Finished indexing collection /db/EIDO/data/Core in 15985 ms.

Update IV

I changed the collection.xconf to as suggested to remove stopwords and removing the WhitespaceAnalyzer:

<collection xmlns="http://exist-db.org/collection-config/1.0">
    <index xmlns:mods="http://www.loc.gov/mods/v3" xmlns:xlink="http://www.w3.org/1999/xlink">
        <fulltext default="none" attributes="false"/>
        <lucene>
            <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
            <text qname="p"/>
            <text qname="li"/>
            <text qname="h1"/>
            <text qname="h2"/>
            <text qname="h3"/>
        </lucene>
    </index>
</collection>

I reindexed the collection. From the log:

2018-12-20 02:14:56,803 [qtp31631875-34] INFO  (NativeBroker.java [reindexCollection]:1844) - Start indexing collection /db/EIDO/data/Core 
2018-12-20 02:15:16,553 [qtp31631875-34] INFO  (NativeBroker.java [reindexCollection]:1854) - Finished indexing collection /db/EIDO/data/Core in 19750 ms.

I get the exact same result.

Update V

I guess I am punting. Going to run the entire process again this weekend, deleting everything and trying again but this makes no sense and does not work.

Update VI

I don't like to punt! Now, in looking at the results, essentially this search in the current installation:

 for $hit in collection(xmldb:encode-uri($collection))//*[ft:query(.,
        <query>
            <phrase>{$phrase}</phrase>
        </query>
        )]

Returns every element in the database, whether they have $phrase or not. It returns the div, then the child p, then maybe the child span. All of them. It does not matter whether the word actually exists in the text.

If I change the wildcard "*" to say "h1", it returns only the h1's that actually have that text in them. So something is not right or broken or? I certainly can chnage the element list fed to the ft:query to the exact elements in question (p, h1, h2, h3, li) but that query takes forever in 4.5 and a few seconds in 2.

Update Likely last

I gave up and reinstalled everything including Monex. I re-exported the existing DB and imported it. I only change the port to 80 although there are other changes I normally make.

Now, even trying to run the dashboard (after import) yields:

javax.servlet.ServletException: javax.servlet.ServletException: An error occurred while processing request to /exist/apps/dashboard/: err:XPST0081 error found while loading module restxq: Error while loading module modules/restxq.xql: Invalid qname text:groups
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146)
    at org.eclipse.jetty.server.handler.gzip.GzipHandler.handle(GzipHandler.java:724)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132)
    at org.eclipse.jetty.server.Server.handle(Server.java:531)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:352)
    at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260)
    at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:281)
    at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:102)
    at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118)
    at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:760)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:678)

Which indicates to me that an export of the database and then reimport will not ever work if you have different apps installed.

Unfortunately I have to punt and look at alternate solutions. I could attempt to just rebuild data or something, but the app had 10,000 users. That cannot be recreated.

At this time, I can only say it is not ready for prime time and will just sit on the old database that works perfectly and has done so for years.

And just to note ... after installation of the fresh, clean database and no changes I can access Monex or dashboard. If I import from my backup (as required because it is not binary compatible) it all breaks.

This is an obvious issue to me for the developers.

Update Again

I did a completely clean install. After that I can access Monex no issues. I then restore my database. NOTE: There is a question at the moment it is finished which is asking if I wish to upgrade the apps. Not sure the right answer, maybe that is one issue and I answer wrong (I answer no).

After all is reinstalled, I can get to the DB fine and my whole application. But when trying to run Monex, I now get:

<exception>
    <path>/db/apps/monex/modules/view.xql</path>
    <message>err:XPST0081 error found while loading module indexes: Error while loading module indexes.xqm: Invalid qname text:index-terms</message> 
</exception>

Is the proper answer yes to upgrade the apps? I assume what this means is that the Monex I installed with just a pure installation is overwritten by my version 2 backup and this is causing an error.

I hacked out the part of monex's index causing an issue and got Monex to run. So it is using Lucene:

So, one observation is the issue as to why Monex runs fine but restore my (old) DB kills it. It should not AFAIK.

maybe someone can explain this result to me, I do not understand the second item but I suspect that it is the one returning everything:

OK, Working

So. First, i figured out that the restore of my /db will ruin all the /apps (like monex) in a fresh install. Seems strange to me or bad planning on my or others. So to fix this issue, I have a fresh install backup.

After I install the new version of eXist, I restore my old database and then immediately restore the fresh install again. This overwrites all the /apps (like monex) with the latest versions that were installed from my backup but does not ruin mine. Sorry, ridiculous.

Now after that I could test and see the the Lucene index is being used. But that is all it told me, nothing else (as I suspected).

It is obvious that the behavior has changed in the Lucene integration. In my old version, I would send every element and it would only return the hits. In this new version, you cannot do that. If you send something like is done in the code above, it will still return it as a "hit" even if there is none. Therefore, the $collection//* sends the entire structure to the query and it returns everything, whether there is a hit or not. It did not behave this way before.

So the solution is (which is such a hack I hate to even say it), that you can only send the items to the query that you want search to see if there is content that is a hit. WOW. Again, I am sorry but if I am wrong please show me but that is a total hack. If I create an index of all the p's, I only expect the p's back if I do a general search sending it p's, h1's, etc. What happens now is it sends everything back, hit or not, unless you ask for the exact same name of element that you indexed.

It seems like a late/early binding thing. In the old eXist I would send $coll/[ft:query... and it returned what I had as identified elements in my index. Now it does not work that way so you cannot execute the for loop across $coll/[ft:query... as it still returns everything. IMHO that is wrong.

So to solve, I did this, basically execute the search first and then iterate over the results.

declare function ls:ls($collection as xs:string, $phrase as xs:string) as element()* {
    let $coll := collection(xmldb:encode-uri($collection))
    let $hits := ($coll//p | $coll//li | $coll//h1 | $coll//h2 | $coll//h3)[ft:query(.,
        <query>
            <phrase>{$phrase}</phrase>
        </query>
        )]
    for $hit in $hits
        order by $hit/ancestor::div[@class='content']/@doc/string()
        return 
            <tr>
                <td>
                    {$hit/ancestor::div[@class='content']/@doc/string()}
                </td>
                <td>
                    {$hit/ancestor::div[@class='content']/@title/string()}
                </td>
                <td>
                    {local-name($hit)}
                </td>
                <td class="hit_text">
                    {normalize-space($hit)}
                </td>
            </tr>
}

;

And now I updated to test and this works also:

let $hits := (collection(xmldb:encode-uri($collection))//*)[ft:query(.,
    <query>
        <phrase>{$phrase}</phrase>
    </query>
    )]
for $hit in $hits ...

So this is now so close to what I had before, I do NOT need to go after the explicit elements which is correct. The issue is that now they cannot be on the for loop.

The key is here:

(collection(xmldb:encode-uri($collection))//*)

versus:

collection(xmldb:encode-uri($collection))//*

And so ... all of that ... and the solution is the the for loop needs to be:

for $hit in (collection(xmldb:encode-uri($collection))//*)[ft:query(.,
    <query>
        <phrase>{$phrase}</phrase>
    </query>
    )]

Since this is now solved, maybe someone would like to explain why the old code which did not use () around the individual elements worked but does not in the latest eXist.

Just to be exact, I have both systems open for testing.

Version 2x:

for $hit in collection(xmldb:encode-uri($collection))//*[ft:query(.,

One second, correct answer.

for $hit in (collection(xmldb:encode-uri($collection))//*)[ft:query(.,

17 seconds, correct answer.

Version 4.5:

for $hit in collection(xmldb:encode-uri($collection))//*[ft:query(.,

10 seconds, completely wrong answer (div's and non-hits returned)

for $hit in (collection(xmldb:encode-uri($collection))//*)[ft:query(.,

one second, right answer.

It looks to me that in old eXist, a query returned nothing and in this new eXist is seems to return a result for every element sent, and if no index exists it still returns it.

One last update

In looking through the clean install conf.xml, I found a comment in the xquery entry for enable-query-rewriting. This comment suggests that it is experimental and setting to "yes" could lead to incorrect results.

I would note that I do not believe I touched this and a default installation has this value set to "yes". I saved out conf.xml from the clean install as I change many things in it (of course), in looking at the clean installation I see this:

<xquery enable-java-binding="no" disable-deprecated-functions="no" 
        enable-query-rewriting="yes" backwardCompatible="no" 
        enforce-index-use="always"
        raise-error-on-failed-retrieval="no">

I changed to "no" and restarted exist-db. Now everything works as it did before, I now have no issues in the search and it returns exactly what I expect with the query written exactly as it was in version 2x.

So ... what I believe I learned

I implemented the new range indexes and reindexed the collection based on the comments below and re-enabled the query rewriting. Checking monex I see the indexes but my queries did not use them, it reported index as the legacy "range" and optimization as "No Index".

I found that I cannot do this (which the wildcard would be doing I assume):

($collection//foo | $collection//bar)[contains(.,$phrase)]

or this

($collection//foo , $collection//bar)[contains(.,$phrase)]

or this

$testnodes := $collection//foo | $collection//bar

then

$testnodes[contains(.,$phrase)]

While it works, it does not use the new-range index. These would always report no index used.

But this does use full optimized, new-range indexes:

$collection//foo[contains(.,$phrase)] | $collection//bar[contains(.,$phrase)]

Upvotes: 1

Answers (3)

adamretter

Reputation: 3517

We should clear up the errors first...

The class for the Whitespace Analyzer should be org.apache.lucene.analysis.core.WhitespaceAnalyzer.

Although it doesn't look like you reference the whitespace analyzer by its 'id' so, you could just remove it.

The config for your use of the StandardAnalyzer looks wrong to me. You have specified a stopwords parameter, but:
1. its class is wrong, it should be org.apache.lucene.analysis.util. CharArraySet, and
2. you have not given it any value(s).

If you just want the default stop words, you can omit the parameter entirely.

Once you have made those changes, you should try reindexing and monitor the logs again.

After that you should use the Monex app from the Dashboard in eXist 4.5.0 to examine the available indexes, to check that your data was indexed as you expected.

Update 1

From the comment of @kevin-brown:

From what I see today, if I do this ($collection//foo | $collection//bar)[fn:contains(.,'string')] no index is used. But if I do this $collection//foo[fn:contains(.,'string')] | $collection//bar[fn:contains(.,'string')],the new-range index is used and optimization is full.

I can confirm that in certain formulation of the XQuery, eXist-db is not correctly optimising the query to make use of the range index. This is certainly a bug!

The Java Admin Client of eXist-db allows you to show a trace of the query:

($collection//foo | $collection//bar)[fn:contains(., $string)] which Kevin reported did not use the index, produces the trace:
```
$collection/descendant::{}foo union
    $collection/descendant::{}bar
        [contains(self::node(), $string)]
```

$collection//foo[fn:contains(., $string)] | $collection//bar[fn:contains(., $string)] which Kevin reported did correctly use the index, produces the trace:

$collection
(# exist:optimize-field #)
(# exist:optimize #) {
    descendant::{}foo[range:contains(self::node(), $string)]
}
union $collection
(# exist:optimize-field #)
(# exist:optimize #) {
    descendant::{}bar[range:contains(self::node(), $string)]
}

In (2) we can clearly see that optimizations are indicated by XQuery pragmas. These mean that a suitable index was detected and will be used during evaluation.

By comparison, in (1) we see that eXist failed to correctly detect the available indexes that could allow for an optimisation.

Sadly, it also seems that eXist-db might have used the wrong axis for these, i.e. descendant rather that descendant-or-self.

I have opened a GitHub issue for eXist-db which reports this problem - https://github.com/eXist-db/exist/issues/2363

Upvotes: 1

duncdrum

Reputation: 733

eXist-db 2.2 was released 2014, so long-jump upgrades across two major versions have a tendency to not be straight forward.

It looks like your code is still using the legacy-range index, which is the likely cause of your unwanted results, as reported by monex.

This index is marked as deprecated with the new range index to be used instead.

If you can't provide a MWE, you need to figure out which of your queries call the old range index and change them to the new, or disable the old-range index entirely.

I would not recommend to use e.g. and old monex inside a new exist, and to say yes when asked to upgrade default apps to newer version. You can still run a production site without any default apps.

It's not possible to tell from your examples how for $hit in (collection(xmldb:encode-uri($collection))//*)[ft:query(., side-steps invocations of the old-range index in your app, it should give you a clue though. My guess is, if you get rid of those invocations, you ll see for $hit in collection(xmldb:encode-uri($collection))//*[ft:query(., to act and work in the same way.

Upvotes: 1

jbrehr

Reputation: 815

Although I'm still new to eXist, it seems to me there are two ideas being conflated.

Telling Lucene to index something is not the same as putting a predicate on a query Xpath. The qname for a Lucene index doesn't (I believe) mean a given element won't be subject to query. It's just a question of what is indexed by Lucene in order to speed searches? The fact that you found a speed improvement by using a predicate suggests this is true.

When I do my searches, I still restrict the elements subject to query regardless of what I tell Lucene to index. I don't personally see that as a hack - just reducing the 'search pool'. I don't use local-name() as predicate. Rather, I would use the element itself. I'm not sure if there is a cost to using local-name() versus this:

let $coll := collection(xmldb:encode-uri($collection))

let target := $coll//p | $coll//h1 | $coll//h2 | $coll//h3 | $coll//li

Depending on your XML hierarchy, you might find even more speed by reducing the pool of nodes with collection(xmldb:encode-uri($collection))//some-element

The above might use then use Lucene indexes more efficiently? It's worth testing.

Furthermore, although I don't know what the hierarchy of your XML is, you can also explicitly tell Lucene to ignore certain elements (but this is usually for those elements nested inside those which you are indexing):

 <ignore qname="div"/>

NB: I use eXist 4.4

Added: try using range index in addition to Lucene. Also I don't see a name-space in the qnames (plus you have two namespaces operating, and I've added a third for xmlns:xs in the range index).

This example assumes (copied from eXist documentation linked above) a namespace of mods for demonstration. But it must be appended to each qname if there is a specific namespace in the xml collections.

<collection xmlns="http://exist-db.org/collection-config/1.0">
  <index xmlns:mods="http://www.loc.gov/mods/v3" 
        xmlns:xlink="http://www.w3.org/1999/xlink"  
        xmlns:xs="http://www.w3.org/2001/XMLSchema">
    <fulltext default="none" attributes="false"/>
    <range>
       <create qname="mods:p" type="xs:string"/>
       <create qname="mods:li" type="xs:string"/>
       <create qname="mods:h1" type="xs:string"/>
       <create qname="mods:h2" type="xs:string"/>
       <create qname="mods:h3" type="xs:string"/>
    </range>
    <lucene>
        <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
        <text qname="mods:p"/>
        <text qname="mods:li"/>
        <text qname="mods:h1"/>
        <text qname="mods:h2"/>
        <text qname="mods:h3"/>
        <ignore qname="mods:div"/>
    </lucene>
  </index>
</collection>

Remove namespace declarations that aren't used.

Upvotes: 1

Issue in index upgrading eXist-db version 2 to 4.5

Update

Update II

Update III

Update IV

Update V

Update VI

Update Likely last

Update Again

OK, Working

One last update

So ... what I believe I learned

Answers (3)

Update 1

Related Questions