Reputation: 2620
wildcard
cts:element-value-query
is not behaving as expected.
insert doc query:
xdmp:document-insert('/sample/2.xml', <data>the living Theater</data>)
cts query:
cts:search(
doc(),
cts:element-value-query(xs:QName('data'), 'theater* *', ('wildcarded', 'case-insensitive', 'unstemmed', 'punctuation-sensitive', 'whitespace-sensitive')),
'unfiltered'
)
Above cts query is returning me the /sample/2.xml
document. As per my understanding this query should not return the above document it should return only the docs starting with theater
text.
seems like the issue is with the below text pattern.
present text in document : @@@ word @@@text
search term: @@@t* *
@ - can be any character.
I am able to reproduce the problem with the below data as well.
present text in doc: mark the marklogic
search text: markl* *
wildcard related indexes are set to true.
I have pasted the database configuration, it might help in finding the problem.
Database configuration:
<package-database xmlns="http://marklogic.com/manage/package/databases">
<config>
<name>publishers</name>
<package-database-properties>
<enabled>true</enabled>
<retired-forest-count>0</retired-forest-count>
<language>en</language>
<stemmed-searches>advanced</stemmed-searches>
<word-searches>true</word-searches>
<word-positions>true</word-positions>
<fast-phrase-searches>true</fast-phrase-searches>
<fast-reverse-searches>false</fast-reverse-searches>
<triple-index>true</triple-index>
<triple-positions>true</triple-positions>
<fast-case-sensitive-searches>true</fast-case-sensitive-searches>
<fast-diacritic-sensitive-searches>true</fast-diacritic-sensitive-searches>
<fast-element-word-searches>true</fast-element-word-searches>
<element-word-positions>true</element-word-positions>
<fast-element-phrase-searches>true</fast-element-phrase-searches>
<element-value-positions>true</element-value-positions>
<attribute-value-positions>true</attribute-value-positions>
<field-value-searches>true</field-value-searches>
<field-value-positions>true</field-value-positions>
<three-character-searches>true</three-character-searches>
<three-character-word-positions>true</three-character-word-positions>
<fast-element-character-searches>true</fast-element-character-searches>
<trailing-wildcard-searches>true</trailing-wildcard-searches>
<trailing-wildcard-word-positions>true</trailing-wildcard-word-positions>
<fast-element-trailing-wildcard-searches>true</fast-element-trailing-wildcard-searches>
<word-lexicons>
<word-lexicon>http://marklogic.com/collation/codepoint</word-lexicon>
</word-lexicons>
<two-character-searches>false</two-character-searches>
<one-character-searches>false</one-character-searches>
<uri-lexicon>true</uri-lexicon>
<collection-lexicon>true</collection-lexicon>
<reindexer-enable>true</reindexer-enable>
<reindexer-throttle>5</reindexer-throttle>
<reindexer-timestamp>0</reindexer-timestamp>
<directory-creation>manual</directory-creation>
<maintain-last-modified>false</maintain-last-modified>
<maintain-directory-last-modified>false</maintain-directory-last-modified>
<inherit-permissions>false</inherit-permissions>
<inherit-collections>false</inherit-collections>
<inherit-quality>false</inherit-quality>
<in-memory-limit>174080</in-memory-limit>
<in-memory-list-size>341</in-memory-list-size>
<in-memory-tree-size>85</in-memory-tree-size>
<in-memory-range-index-size>11</in-memory-range-index-size>
<in-memory-reverse-index-size>11</in-memory-reverse-index-size>
<in-memory-triple-index-size>44</in-memory-triple-index-size>
<large-size-threshold>1024</large-size-threshold>
<locking>fast</locking>
<journaling>fast</journaling>
<journal-size>682</journal-size>
<journal-count>2</journal-count>
<preallocate-journals>false</preallocate-journals>
<preload-mapped-data>false</preload-mapped-data>
<preload-replica-mapped-data>false</preload-replica-mapped-data>
<range-index-optimize>facet-time</range-index-optimize>
<positions-list-max-size>256</positions-list-max-size>
<format-compatibility>automatic</format-compatibility>
<index-detection>automatic</index-detection>
<expunge-locks>none</expunge-locks>
<tf-normalization>scaled-log</tf-normalization>
<merge-priority>lower</merge-priority>
<merge-max-size>32768</merge-max-size>
<merge-min-size>1024</merge-min-size>
<merge-min-ratio>2</merge-min-ratio>
<merge-timestamp>0</merge-timestamp>
<retain-until-backup>false</retain-until-backup>
<assignment-policy-name>bucket</assignment-policy-name>
</package-database-properties>
</config>
</package-database>
Upvotes: 2
Views: 991
Reputation: 2192
Having a unfiltered search comes with some caveats:
- They determine the results directly from the indexes, without filtering for validation. This makes unfiltered results most comparable to traditional search-engine style results.
- They include false-positive results. False-positive results can originate from a number of situations, including phrase searches containing 3 or more words, certain wildcard searches, punctuation-sensitive, diacritic-sensitive, and/or case-sensitive searches.
MarkLogic provides a way to determine if a result is a false-positive. You can use cts:contains
for that. This xquery shows that your result is indeed a false positive:
xquery version "1.0-ml";
declare boundary-space preserve;
declare namespace qm="http://marklogic.com/xdmp/query-meters";
let $trueCounter := 0
let $falseCounter := 0
let $query := cts:element-value-query(xs:QName('data'), 'theater* *')
let $x :=
for $result in cts:search(fn:doc(), $query, "unfiltered")
return
(
if ( cts:contains($result, $query) )
then ( xdmp:set($trueCounter, $trueCounter + 1) )
else ( xdmp:set($falseCounter, $falseCounter + 1) )
)
return
<results>
<resultTotal>{$trueCounter}</resultTotal>
<false-positiveTotal>{$falseCounter}</false-positiveTotal>
<elapsed-time>{xdmp:query-meters()/qm:elapsed-time/text()}
</elapsed-time>
</results>
MarkLogic searches are split into two steps:
By using a unfiltered query, you do not have the second step and thereby false-positives. You can read more about that here.
Edit: This section further describes applications which can use unfiltered searches:
- Your content and search terms are such that you know the unfiltered searches are also accurate (for example, the searches are all performed at document or fragment roots, they are single-term queries, and are not wildcard, punctuation-sensitive, diacritic-sensitive, and/or capitalization-sensitive searches).
- You do not mind if there are some false-positive results because the results are an estimate (that is, they need to be fast, but are not required to be exact).
- Your searches return a large number of results and you want efficient ways to jump to a particular portion of those results.
As item one states, you cannot use wildcard queries if you do not want false-positives. I guess you should stick to filtered searches then.
Hope this helps!
Upvotes: 2
Reputation: 121
Try creating an element range index on data element and then run the below search:
let $terms := cts:element-value-match(xs:QName("data"),"theater* *")
return
cts:search(
doc(),
cts:element-value-query(
xs:QName('data'),
$terms,
('wildcarded', 'case-insensitive', 'unstemmed', 'punctuation-sensitive', 'whitespace-sensitive')
),
'unfiltered'
)
This will not fetch you '/sample/2.xml' document
Upvotes: 1