Joe Glorioso
Joe Glorioso

Reputation: 123

How do you use the MarkLogic thesaurus API on phrases?

Let's say that you have a thesaurus entry for the United States of America that includes United States, USA, and America. Not the best example, but you get the idea. A user searches for United States government. How do you parse this string to pass into thsr:expand function?
"United States government" would not work, nor is it what I want. I want thesaurus entires for "United States" so that documents with USA government and United States of America government are returned. Thanks in advance.

Upvotes: 0

Views: 427

Answers (1)

wst
wst

Reputation: 11771

Unless a change has been made recently, thsr:expand does not work with multiple word thesaurus terms. However, it is possible to roll your own multi-word thesaurus expansion.

There are several steps to the solution, and I actually gave this - exactly - as an example in a presentation at MarkLogic World titled Search Intelligence and MarkLogic API. The multi-word thesaurus example begins at slide 32, if you want to skip ahead.

I have also put all the code from the presentation up on Github.

The gist is: First search:parse and convert the cts:query XML into an intermediate-type XML that contains "runs" (if you are familiar with WordML). Then the runs are expanded using cts:highlight and an OR-query of thesaurus terms. Finally, the remaining runs are resolved back into cts:query XML, and searched using search:resolve.

It's pretty fast, but if your thesaurus is truly enormous, the speed could probably be improved with some optimizations.

Update

I just noticed that you may only be trying to expanded a quoted phrase to multiple quoted phrase synonyms, while my example expands unquoted phrases into an OR-query of AND-ed word queries (unquoted phrases).

You could actually skip the run creation/resolution steps, and rework exprun:thsr-expand-runs into something that works directly on phrases:

declare function exprun:thsr-expand-phrases(
    $q as item(),     (: cts:query XML :)
    $q-thsr as item() (: thesaurus terms :)
) as item()
{
    typeswitch($q)    
        case element(cts:word-query) return
            if (not($q[@qtextpre and @qtextpost])) then $q 
            else (: this is a phrase :)
            cts:highlight($q, $q-thsr,
                if (count($cts:queries) gt 1)
                then xdmp:set($cts:action, "continue") (: ignore matches within matches :)
                else          
                    element cts:word-query {
                    $q/namespace::*, $q/@*, $q/node(),
                    let $expanded-text :=
                        cts:highlight($q/cts:text, $q-thsr,
                            if (count($cts:queries) gt 1)
                            then xdmp:set($cts:action, "continue") 
                            else thsr:lookup("/config/jmp-thesaurus.xml",
                                cts:word-query-text($cts:queries[1])//thsr:synonym/thsr:term/string()
                    )
                where ($expanded-text ne $q/cts:text) (: found matches :)
                return ($expanded-text,
                    element cts:option { 'synonym' })
            }
        else $q
    case text() return $q
    default return
        element {node-name($q)}{
            $q/namespace::*,
            $q/@*,
            exprun:thsr-expand-phrases($q/node(), $q-thsr)
        }

};

You will still need to supply this function a cts:or-query of thesaurus terms:

cts:or-query(doc('thesaurus.xml')//thsr:entry/thsr:term/cts:word-query(string(.)))))

This will only operate on quoted phrases, though. So if you want to operate on unquoted phrases, you would still need to create runs. If you want to operate on both, you'll need to make minor changes to the github example code (it skips quoted phrases).

Upvotes: 1

Related Questions