Join / split search words in elasticsearch (using tire)

Question

I have the following analyzer (a slight tweak to the way snowball would be setup):

  string_analyzer: {
    filter: [ "standard", "stop", "snowball" ],
    tokenizer: "lowercase"
  }

Here is the field it is applied to:

  indexes :title, type: 'string', analyzer: 'string_analyzer'

  query do
    match ['title'], search_terms, fuzziness: 0.5, max_expansions: 10, operator: 'and'
  end

I have a record in my index with title foo bar.

If I search for foo bar it appears in the results.

However, if I search for foobar it doesn't.

Can someone explain why and if possible how I could get it to?

Can someone explain how I could get the reverse of this to work as well so that if I had a record with title foobar a user could search for foo bar and see it as a result?

Thanks

DrTech · Accepted Answer

You can only search for tokens that are in your index. So let's look at what you are indexing. You're currently using the lowercase tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard filter (redundant, because you are not using the standard tokenizer), the stop and snowball filters.

If we create that analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "string_analyzer" : {
               "filter" : [
                  "standard",
                  "stop",
                  "snowball"
               ],
               "tokenizer" : "lowercase"
            }
         }
      }
   }
}
'

and use the analyze API to test it out:

curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'

you'll see that "foo bar" produces the terms ["foo","bar"] and "foobar" produces the term ["foobar"]. So indexing "foo bar" and searching for "foobar" currently cannot work.

If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram analyzer.

So delete the test index:

curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'

and specify a new analyzer:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "ngrams" : {
               "max_gram" : 5,
               "min_gram" : 1,
               "type" : "ngram"
            }
         },
         "analyzer" : {
            "ngrams" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "ngrams"
               ],
               "tokenizer" : "standard"
            }
         }
      }
   }
}
'

Now, if we test the analyzer, we get:

"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar"  => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]

So if we index "foo bar" and we search for "foobar" using the match query, then the query becomes a query looking for any of those tokens, some of which exist in the index.

Unfortunately, it'll also overlap with "wear the fox hat" (f,o,a). While foobar will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.

This can be controlled by using the minimum_should_match parameter, eg:

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1'  -d '
{
   "query" : {
      "match" : {
         "my_field" : {
            "minimum_should_match" : "60%",
            "query" : "foobar"
         }
      }
   }
}
'

The exact value for minimim_should_match depends upon your data - experiment with it.

Join / split search words in elasticsearch (using tire)

Answers (1)

Related Questions