Field index only updated after merge

Question

I have created a simple field index which looks like this:

field name: root_test
include root: false
word lexicons: http://marklogic.com/collation/de/S1
index settings: only word searches enabled
Included Elements: A element content

I am creating a document with an element content and two child-elements header and body. The second request uses the field index to find all values and test if it contains a word Body. As expected, it does. I am then updating my document without the body element and again requesting the field index words. The field index still contains the word Body. This is my test script:

xquery version "1.0-ml";

xdmp:document-insert("test.xml", 
  
    not found
    
      Found
      Body
    
  
);
fn:exists(fn:index-of(
  cts:field-words("root_test", (), ("collation=http://marklogic.com/collation/de/S1")), 
  "Body"
)) = fn:true();

xdmp:document-insert("test.xml", 
  
    not found
    
      Found
    
  
);
fn:empty(fn:index-of(
  cts:field-words("root_test", (), ("collation=http://marklogic.com/collation/de/S1")),
  "Body"
)) = fn:true()

I expected the following output:

true
true

But what I actually get is:

true
false

Only if I execute a manual merge after the update (second insert), the word Body gets removed from the field index.

Am I doing something wrong here? Using 9.0-8

mholstege · Accepted Answer

The word lexicon doesn't keep track of specific document instances -- to do so would be prohibitively expensive -- and so it cannot purge information about deleted words until after a merge. Word lexicons on for query suggestion and to assist certain wildcard queries; you shouldn't count on them to provide precise information about the presence or absence of specific words in the corpus.

If want to know whether a specific word is in the corpus, do an estimate of a word query, e.g. xdmp:estimate(cts:search(doc(),cts:word-query("Body",("unstemmed","case-insensitive","diacritic-insensitive")))). That won't give quite the same equality constraints as your collation, however, because search is codepoint based and doesn't fold compatibility characters and the like.

Field index only updated after merge

Answers (1)

Related Questions