Reputation: 2620

wildcard cts:element-value-query returning wrong matches

wildcard cts:element-value-query is not behaving as expected.

insert doc query:

xdmp:document-insert('/sample/2.xml', <data>the living Theater</data>)

cts query:

cts:search(
    doc(),
    cts:element-value-query(xs:QName('data'), 'theater* *', ('wildcarded', 'case-insensitive', 'unstemmed', 'punctuation-sensitive', 'whitespace-sensitive')),
    'unfiltered'
)

Above cts query is returning me the /sample/2.xml document. As per my understanding this query should not return the above document it should return only the docs starting with theater text.

seems like the issue is with the below text pattern.

present text in document : @@@ word @@@text

search term: @@@t* *

@ - can be any character.

I am able to reproduce the problem with the below data as well.

present text in doc: mark the marklogic

search text: markl* *

wildcard related indexes are set to true.

I have pasted the database configuration, it might help in finding the problem.

Database configuration:

<package-database xmlns="http://marklogic.com/manage/package/databases">
    <config>
        <name>publishers</name>
        <package-database-properties>
            <enabled>true</enabled>
            <retired-forest-count>0</retired-forest-count>
            <language>en</language>
            <stemmed-searches>advanced</stemmed-searches>
            <word-searches>true</word-searches>
            <word-positions>true</word-positions>
            <fast-phrase-searches>true</fast-phrase-searches>
            <fast-reverse-searches>false</fast-reverse-searches>
            <triple-index>true</triple-index>
            <triple-positions>true</triple-positions>
            <fast-case-sensitive-searches>true</fast-case-sensitive-searches>
            <fast-diacritic-sensitive-searches>true</fast-diacritic-sensitive-searches>
            <fast-element-word-searches>true</fast-element-word-searches>
            <element-word-positions>true</element-word-positions>
            <fast-element-phrase-searches>true</fast-element-phrase-searches>
            <element-value-positions>true</element-value-positions>
            <attribute-value-positions>true</attribute-value-positions>
            <field-value-searches>true</field-value-searches>
            <field-value-positions>true</field-value-positions>
            <three-character-searches>true</three-character-searches>
            <three-character-word-positions>true</three-character-word-positions>
            <fast-element-character-searches>true</fast-element-character-searches>
            <trailing-wildcard-searches>true</trailing-wildcard-searches>
            <trailing-wildcard-word-positions>true</trailing-wildcard-word-positions>
            <fast-element-trailing-wildcard-searches>true</fast-element-trailing-wildcard-searches>
            <word-lexicons>
                <word-lexicon>http://marklogic.com/collation/codepoint</word-lexicon>
            </word-lexicons>
            <two-character-searches>false</two-character-searches>
            <one-character-searches>false</one-character-searches>
            <uri-lexicon>true</uri-lexicon>
            <collection-lexicon>true</collection-lexicon>
            <reindexer-enable>true</reindexer-enable>
            <reindexer-throttle>5</reindexer-throttle>
            <reindexer-timestamp>0</reindexer-timestamp>
            <directory-creation>manual</directory-creation>
            <maintain-last-modified>false</maintain-last-modified>
            <maintain-directory-last-modified>false</maintain-directory-last-modified>
            <inherit-permissions>false</inherit-permissions>
            <inherit-collections>false</inherit-collections>
            <inherit-quality>false</inherit-quality>
            <in-memory-limit>174080</in-memory-limit>
            <in-memory-list-size>341</in-memory-list-size>
            <in-memory-tree-size>85</in-memory-tree-size>
            <in-memory-range-index-size>11</in-memory-range-index-size>
            <in-memory-reverse-index-size>11</in-memory-reverse-index-size>
            <in-memory-triple-index-size>44</in-memory-triple-index-size>
            <large-size-threshold>1024</large-size-threshold>
            <locking>fast</locking>
            <journaling>fast</journaling>
            <journal-size>682</journal-size>
            <journal-count>2</journal-count>
            <preallocate-journals>false</preallocate-journals>
            <preload-mapped-data>false</preload-mapped-data>
            <preload-replica-mapped-data>false</preload-replica-mapped-data>
            <range-index-optimize>facet-time</range-index-optimize>
            <positions-list-max-size>256</positions-list-max-size>
            <format-compatibility>automatic</format-compatibility>
            <index-detection>automatic</index-detection>
            <expunge-locks>none</expunge-locks>
            <tf-normalization>scaled-log</tf-normalization>
            <merge-priority>lower</merge-priority>
            <merge-max-size>32768</merge-max-size>
            <merge-min-size>1024</merge-min-size>
            <merge-min-ratio>2</merge-min-ratio>
            <merge-timestamp>0</merge-timestamp>
            <retain-until-backup>false</retain-until-backup>
            <assignment-policy-name>bucket</assignment-policy-name>
        </package-database-properties>
    </config>
</package-database>

Upvotes: 2

Answers (2)

Wagner Michael

Reputation: 2192

Having a unfiltered search comes with some caveats:

They determine the results directly from the indexes, without filtering for validation. This makes unfiltered results most comparable to traditional search-engine style results.

They include false-positive results. False-positive results can originate from a number of situations, including phrase searches containing 3 or more words, certain wildcard searches, punctuation-sensitive, diacritic-sensitive, and/or case-sensitive searches.

MarkLogic provides a way to determine if a result is a false-positive. You can use cts:contains for that. This xquery shows that your result is indeed a false positive:

xquery version "1.0-ml";

declare boundary-space preserve;
declare namespace qm="http://marklogic.com/xdmp/query-meters";

let $trueCounter := 0
let $falseCounter := 0
let $query := cts:element-value-query(xs:QName('data'), 'theater* *')
let $x := 
  for $result in cts:search(fn:doc(), $query, "unfiltered")
  return
  (
  if ( cts:contains($result, $query) )
  then ( xdmp:set($trueCounter, $trueCounter + 1) )
  else ( xdmp:set($falseCounter, $falseCounter + 1) )
  )
return
<results>
  <resultTotal>{$trueCounter}</resultTotal>
  <false-positiveTotal>{$falseCounter}</false-positiveTotal>
  <elapsed-time>{xdmp:query-meters()/qm:elapsed-time/text()}
  </elapsed-time>
</results>

MarkLogic searches are split into two steps:

Candidate id resolution. ML searches for matching documents from the index. These are only candidates, meaning they might be false positives. This is useful for narrowing down documents, so it does not have to load too many fragments.
Candidate id are used to load fragments from disk. Each Fragment will then be again tested against the initial query. This step filters false-positives.

By using a unfiltered query, you do not have the second step and thereby false-positives. You can read more about that here.

Edit: This section further describes applications which can use unfiltered searches:

Your content and search terms are such that you know the unfiltered searches are also accurate (for example, the searches are all performed at document or fragment roots, they are single-term queries, and are not wildcard, punctuation-sensitive, diacritic-sensitive, and/or capitalization-sensitive searches).

You do not mind if there are some false-positive results because the results are an estimate (that is, they need to be fast, but are not required to be exact).

Your searches return a large number of results and you want efficient ways to jump to a particular portion of those results.

As item one states, you cannot use wildcard queries if you do not want false-positives. I guess you should stick to filtered searches then.

Hope this helps!

Upvotes: 2

Pragya Kapoor

Reputation: 121

Try creating an element range index on data element and then run the below search:

let $terms :=  cts:element-value-match(xs:QName("data"),"theater* *")
return
  cts:search(
    doc(),
    cts:element-value-query(
      xs:QName('data'), 
      $terms, 
      ('wildcarded', 'case-insensitive', 'unstemmed', 'punctuation-sensitive', 'whitespace-sensitive')
    ),
    'unfiltered'
  )

This will not fetch you '/sample/2.xml' document

Upvotes: 1

wildcard cts:element-value-query returning wrong matches

Answers (2)

Related Questions