Reputation: 123
Let's say that you have a thesaurus entry for the United States of America that includes United States, USA, and America. Not the best example, but you get the idea. A user searches for United States government. How do you parse this string to pass into thsr:expand function?
"United States government" would not work, nor is it what I want. I want thesaurus entires for "United States" so that documents with USA government and United States of America government are returned. Thanks in advance.
Upvotes: 0
Views: 427
Reputation: 11771
Unless a change has been made recently, thsr:expand
does not work with multiple word thesaurus terms. However, it is possible to roll your own multi-word thesaurus expansion.
There are several steps to the solution, and I actually gave this - exactly - as an example in a presentation at MarkLogic World titled Search Intelligence and MarkLogic API. The multi-word thesaurus example begins at slide 32, if you want to skip ahead.
I have also put all the code from the presentation up on Github.
The gist is: First search:parse
and convert the cts:query XML into an intermediate-type XML that contains "runs" (if you are familiar with WordML). Then the runs are expanded using cts:highlight
and an OR-query of thesaurus terms. Finally, the remaining runs are resolved back into cts:query XML, and searched using search:resolve
.
It's pretty fast, but if your thesaurus is truly enormous, the speed could probably be improved with some optimizations.
Update
I just noticed that you may only be trying to expanded a quoted phrase to multiple quoted phrase synonyms, while my example expands unquoted phrases into an OR-query of AND-ed word queries (unquoted phrases).
You could actually skip the run creation/resolution steps, and rework exprun:thsr-expand-runs
into something that works directly on phrases:
declare function exprun:thsr-expand-phrases(
$q as item(), (: cts:query XML :)
$q-thsr as item() (: thesaurus terms :)
) as item()
{
typeswitch($q)
case element(cts:word-query) return
if (not($q[@qtextpre and @qtextpost])) then $q
else (: this is a phrase :)
cts:highlight($q, $q-thsr,
if (count($cts:queries) gt 1)
then xdmp:set($cts:action, "continue") (: ignore matches within matches :)
else
element cts:word-query {
$q/namespace::*, $q/@*, $q/node(),
let $expanded-text :=
cts:highlight($q/cts:text, $q-thsr,
if (count($cts:queries) gt 1)
then xdmp:set($cts:action, "continue")
else thsr:lookup("/config/jmp-thesaurus.xml",
cts:word-query-text($cts:queries[1])//thsr:synonym/thsr:term/string()
)
where ($expanded-text ne $q/cts:text) (: found matches :)
return ($expanded-text,
element cts:option { 'synonym' })
}
else $q
case text() return $q
default return
element {node-name($q)}{
$q/namespace::*,
$q/@*,
exprun:thsr-expand-phrases($q/node(), $q-thsr)
}
};
You will still need to supply this function a cts:or-query
of thesaurus terms:
cts:or-query(doc('thesaurus.xml')//thsr:entry/thsr:term/cts:word-query(string(.)))))
This will only operate on quoted phrases, though. So if you want to operate on unquoted phrases, you would still need to create runs. If you want to operate on both, you'll need to make minor changes to the github example code (it skips quoted phrases).
Upvotes: 1