Pedery
Pedery

Reputation: 3636

Best way to handle title search in XML using Xquery when the node contains special characters

I was recently given the task of creating a search field in our MarkLogic database. The point in our XML that needs to be searched can look like this:

<title_group>
    <title xml:lang="fr" source="sdo">Amendement 2 - Dispositifs à semiconducteurs - Partie 16-1: Circuits intégrés hyperfréquences - Amplificateurs</title>
    <title xml:lang="en" source="sdo">Amendment 2 - Semiconductor devices - Part 16-1: Microwave integrated circuits - Amplifiers</title>
    <title xml:lang="no">Tillegg 2 - Halvlederenheter - Del 16-1: Mikrobøgekretser - Forsterkere</title>
  </title_group>

These nodes are currently not a range element index in the admin.

Now, in this particular case, I believe the hyphens are causing problems. I've tried:

  let $searchTerm := fn:replace($title, "\s+-\s+", "* *")
  let $searchTerm := fn:replace($searchTerm, "-", "* *")

but to little avail.

The current search is done as follows:

  let $product_query:= cts:element-word-query(xs:QName("product:title"), fn:concat("*",$searchTerm,"*"), ("case-insensitive", "punctuation-insensitive"))
  let $products := cts:search(/product:product, $product_query, ("filtered", $index_order))[1 to $result_limit]

This enables me to get a proper result when I search for "Tillegg 2" or "Tillegg 2 - Halvlederenheter", but it fails when I include anything more of the title. Do I need to preprocess the string into an and-query, or is there a smarter way?

Upvotes: 1

Views: 159

Answers (2)

asusu
asusu

Reputation: 321

I'm not sure why something simpler doesn't work. With that xml doc in my db I can get it back with

let $searchTerm := 'Tillegg 2 - Halvlederenheter - Del 16-1: Mikrobøgekretser'
let $product_query
    := cts:element-word-query(xs:QName("title"), $searchTerm, ('lang=no'))
return cts:search(/, $product_query)

Is that what you wanted?

I had to change/simplify a lot from what you posted. Also, lang=no might be treated as a generic language in v8, though that doesn't come into play exactly here. If you want the words to appear in any order (like your solution) then this seems to work:

let $searchTerm := 'Mikrobøgekretser Tillegg Halvlederenheter 2 - 
    Halvlederenheter - Del 16'
let $words := fn:distinct-values (cts:tokenize ($searchTerm, 'lang=no')
    ! (if (. instance of cts:word) then . else ()))
let $product_query := cts:element-word-query(xs:QName("title"), $words, 
    ('lang=no'))
return ($words, cts:search(/, $product_query))

Edit: sorry, that last is an OR, not an AND. For that, you could get the words the same way, and then construct the and query as you did.

Upvotes: 0

Pedery
Pedery

Reputation: 3636

If anyone else happens to look for an answer to the same thing, this is how I solved it:

  1. Use fn:normalize-space on the search string, to remove whitespace
  2. Use fn:tokenize($searchString, '\s+') to get a list of search tokens.
  3. Remove single-letter tokens
  4. Make a cts:and-query with a number of cts:element-word-query inside it. They had the search options "case-insensitive", "punctuation-insensitive", "diacritic-insensitive", "whitespace-insensitive", "unstemmed", "unwildcarded"

Upvotes: 2

Related Questions