Reputation: 13
I have a question regarding the behavior of the wildcarded search in MarkLogic.
Basically, what I am trying to do is to replicate the SQL like %something% query.
Here is the code that returns false positives:
xquery version "1.0-ml";
cts:search(/,
cts:element-query(fn:QName("","Document"),
cts:element-word-query(fn:QName("","Information"),"*date*", ("wildcarded"),0), ()),
'unfiltered')
A few notes:
I am using the Unicode Collation and have enabled :
What I don't understand is why "*something" and "something*" return correct values, but "*something*" returns false positives? How can I fix this?
Input example:
<Document><Information>another updated document</Information></Document>
<Document><Information>INCUMBENCY CERTIFICATE</Information></Document>
<Document><Information>Certificate of Incumbency</Information></Document>
<Document><Information>something 344_dated 243</Information></Document>
<Document><Information>another terminated document</Information></Document>
Output:
All documents are a match, although only 1 and 4 should be returned.
Final edit: The only thing I would like to add is that it seemed that on two databases - one with a heavier load of documents, the same settings did not generate the same results. On the database with lots of documents, the final settings that I used and which give the correct results are :
Upvotes: 1
Views: 316
Reputation: 11771
Unfiltered wildcard queries within specific elements (i.e. not just with a document) may return false positives without positional indexes. I would try enabling either or both of word positions
and element word positions
. It may also be worth testing whether you see additional performance improvements from enabling fast element phrase searches
.
It's possible that simply because "*something and something*" contains more terms it is filtering out false positives and not because it is more accurately resolving that phrase though indexes.
Update: After reviewing your updated test case, it appears that trailing wildcard index accuracy is not good enough without trailing wildcard word positions
enabled. That and three character word positions
appear to be necessary to resolve this type of leading-and-trailing element wildcard query.
I would recommend disabling one character searches
and two character searches
if they are not strictly necessary, since they will generate large indexes. fast element character searches
and fast element trailing wildcard searches
also do not appear to be required for accuracy in your case, so you might want to test if your queries are fast enough without them.
Upvotes: 4
Reputation: 348
While using the cts:element-value-query, did you tried using the "exact" options to get your exact results ? Try using that once and let me know how it behaves. I have faced a similar issue once.
Upvotes: 0