XQuery Full Text search with Word1 and NOT Word2

Question

Following is the XML Structure -


  
    Doc 1
    
        
          This is a special note section. 
           This B Tag is used for highlighting any text and is optional        
           This U Tag will underline any text and is optional        
           This I Tag is used for highlighting any text and is optional        
              
        
           
            This will store the general notes and might have number of paragraphs. This is para no 1. NO Child Tags here         
           
           
            This is para no 2            
             
              
      
    
        
          This is used for Description and might have number of paragraphs. Here too, there will be B, U and I Tags for highlighting the description text and are optional
          Bold
          Italic
          Underline
        
        
          This is description para no 2 with I and U Tags
          Italic
          Underline

There will be 1000's of Doc Tags. I want to give user a search criteria, where he can search WORD1 and NOT WORD2. Following is the query -

for $x in doc('Documents')/Docs/Doc[Notes/specialNote/text() contains text 'Tom' 
ftand  ftnot 'jerry' or 
Notes/specialNote/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/B/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/I/text() contains text 'Tom' ftand ftnot 'jerry' or 
Notes/specialNote/U/text() contains text 'Tom' ftand ftnot 'jerry' or
Notes/generalNote/P/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/B/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/I/text() contains text 'Tom' ftand ftnot 'jerry' or 
Desc/P/U/text() contains text 'Tom' ftand ftnot 'jerry']
return $x/Name

The result of this query is wrong. I mean, the result contains some doc with both Tom and jerry. So I changed the query to -

for $x in doc('Documents')/Docs/Doc[. contains text 'Tom' ftand ftnot 'jerry'] 
return $x/Name

This query gives me the exact result, ie; only those docs with Tom and Not jerry, BUT IS TAKING HUGE TIME... Approx. 45 secs, whereas the earlier one took 10 secs !!

I am using BaseX 7.5 XML Database.

Need expert comments on this :)

Leo W&#246;rteler · Accepted Answer

The first query tests each text node in the document separately, so

Tom and Jerry

would match because the first text node contains Tom but not Jerry.

In the second query the full-text search is performed on all the text contents of the Doc elements as if they were concatenated into one string. This cannot (currently) be answered by BaseX's fulltext index, which indexes each text node separately.

A solution would be to perform the fulltext searches for each term separately and merging the results in the end. This can be done for each text node separately, so the index can be used:

for $x in (doc('Documents')/Docs/Doc[.//text() contains text 'Tom']
            except doc('Documents')/Docs/Doc[.//text() contains text 'Jerry'])
return $x/Name

The above query is rewritten by the query optimizer to this equivalent one using two index accesses:

for $x in (db:fulltext("Documents", "Tom")/ancestor::*:Doc
            except db:fulltext("Documents", "Jerry")/ancestor::*:Doc)
return $x/Name

You can even tweak the order in which you are merging the results in order to keep intermediate results small if you want.

XQuery Full Text search with Word1 and NOT Word2

Answers (1)

Related Questions