Konstantin Brodin
Konstantin Brodin

Reputation: 142

Elasticsearch long phrases search

I'm using Elasticsearch for a full text search and I'm trying to find a better way to search for a long phrases.

For example i have a field "Seller" that can be up to 250 chars and i want to find all items with Seller = 'some seller name with spaces'.

If i understand correctly, in order to search text that contains spaces, i have to use NGramTokenizer that basically creates tokens like:

's', 'so', 'som', 'some', 'some ', 'some s' etc. 

I know that i can define min and max gram, but i need to be able to search for 'a b', so my min gram has to be at least 3 and max gram as my field max length.

So i have to create a lot of tokens per one item, and it's only seller, but what about description with 4k chars?

This solution has a very low performance.

Can anyone suggest a better solution to work with long phrases with spaces?

My index settings:

analysis: {
  analyzer: {
    autoComplete: {
      filter: [
        "lowercase"
      ],
      type: "custom",
      tokenizer: "autoComplete"
    },
    caseInsensitive: {
      type: "custom",
      filter: [
        "lowercase"
      ],
      tokenizer: "keyword"
    }
  },
  tokenizer: {
    autoComplete: {
      type: "nGram",
      min_gram: "1",
      max_gram: "40"
    }
  }
},

I use "autoComplete" as an index analyzer and "caseInsensitive" as search analyzer

EDIT:

I use an NGramTokenizer in order to be able to search parts of the words

real word example:

Title: 'Huge 48" Bowtie LED Opti neon wall sign. 100,000 hours Bar lamp light'

search query: 'Huge 48" Bowt'

With whitespace tokenizer you can't search parts of the words if you search for a phrase.

Upvotes: 4

Views: 3653

Answers (1)

slawek
slawek

Reputation: 2779

First question you need to answer is: do you need to match substrings within words. For example matching miss in transmission. If you need this functionality then there's no better way to achieve it than ngrams. Trying to use wildcard at the beginning of the term, would mean going through every term in index to see if it matches and it doesn't scale well.

Note that you can use ngrams in two ways: as tokenizer or as token filter. Instead of the tokenizer you used, you could also use the token filter variant. First tokenize text with standard or whitespace tokenizer, then apply ngram token filter. With token filter you wouldn't have grams with spaces in your index. How often do you need to find text where there is a word that ends with ing and immediately after it there's a word that starts with to?

If you don't need to look inside of word, but sometimes want to left out the suffix, there are couple other options. First one is the other kind of grams, the edge grams, which are anchored at the beginning of the word. Most common use case scenario for edge ngrams is search-as-you-type functionality.

Below you can see example comparison of indexing (screenshot from inquisitor plugin) huge bowtie using all those grams approaches (min: 2 max: 3):

enter image description here

Numbers by the tokens are important, they are position number. The position numbers are used when looking up phrases. Looking for phrase "a b" is essentially looking for the token "a" then looking for the token "b" and checking if their position difference is equal to 1. As you can see above, those grams resulting positions may cause some problems when looking up phrases.

First, let see how phrase queries would be interpreted for field analyzed in this way with query "huge bowtie" using _validate API:

  • edge_filter "(hu hug huge) (bo bow bowt bowti bowtie)"
  • edge_tokenizer "hu hug huge bo bow bowt bowti bowtie"
  • ngram_filter "(hu hug ug uge ge) (bo bow ow owt wt wti ti tie ie)"
  • ngram_tokenizer "hu hug ug uge ge bo bow ow owt wt wti ti tie ie"

The tokenizer query interpretations are rather straightforward: instead of looking two tokens one after another, you have to look all the grams and make sure they are follow each other. The filter versions are more troublesome: query "huge bowtie" would match text hu owt because it's enough that at least one gram within word matches.

You must also be careful if you use analyzed queries and don't specify that you need phrase search. For example using "query_string": { "query": "bowtie" } will translate to bo OR bow OR bowt OR bowt OR bowti OR bowtie for edge ngrams, because default query_string operator is OR. That's rather not what user wanted because it will match anything with bo.

Notice also that if there is more that one token on the same position there is a problem that some phrases will match even though they shouldn't. For example phrase "hu bowti" would match with edge_filter and ngram_filter tokens even though there is no such phrase in source text.

It may seem that token filters variants of grams are inferior and not really useful. But when using gram token filters people commonly use different analyzer for searching than for indexing. For instance if we leave query "huge bowtie" as is without analyzing it, it would find match by looking up only 2 terms (because they are all in index, there's huge:1 and bowtie:2). Using this approach though, you need to set n rather high (to be 100% sure everything will match it should be equal to the longest word). Otherwise you could have a situation when using max gram 5, where you wouldn't match bowtie search because index would contain only bowti token.

As you can see grams introduce quite complex problems. That's why people usually combine grams with normally indexed text (using multi_field mapping). Leaving yourself options later in future. Indexing the same text with different analyzers allows to search in multiple ways and increase precision when using two fields in search at once.

If you don't want to deal with all those grams related problems. You could simply index text normally and use wildcards. You pay the price in search time, but depending on your data and scenarios, it could work.. Personally at my company we use wildcards to query indexes which have together couple billions documents and elastic handles it just fine.

If you decide to use wildcard queries you have couple options. You can use wildcard query or query_string query. But using them you won't be able to make phrase and wildcard suffix query at once. Hopefully there's match query variant which does exactly what you want: searches phrase with last word treated as not complete:

{
    "match_phrase_prefix" : {
        "message" : {
            "query" : "Huge 48" Bowt",
            "max_expansions" : 100
        }
    }
}

Excerpt from docs:

The match_phrase_prefix is the same as match_phrase, except that it allows for prefix matches on the last term in the text.

To sum it up.

If I understand your case correctly, I would use edge tokenizer or my favourite edge token filter (with standard search analyzer) in a multifield with original text. Having original text allows to use lower values in edge grams. Having such mapping you could use following query_string: "originalText: \"Huge 48" Bowt\" OR edgeGrammed: \"Huge 38" Bowt\"". You wouldn't have to worry that your n in edge gram is too low, because you have a fallback in original text. I think n equal to 10-15 should be enough? Also with original text wildcards are always an option.

Here is a nice article about ngrams as well.

Upvotes: 7

Related Questions