ybensimhon
ybensimhon

Reputation: 113

Elasticsearch Ngram and Query String Query

I am using Elasticsearch 1.2.1.

I am using Ngram tokenizer to tokenize my docs. I have a special use case, where my field may be very long (200-500 chars) and I would like to support lengthy (up to 200 chars) "contains" queries from any point of the field.

I started with Ngram analyzer with up to 260 chars and quickly discovered index time is too slow and capacity is too large, so I reduced the size to about 30 chars.

Now, I would like to be able to break tokens larger than 30 chars into smaller tokens and replace the user search with the broken tokens (knowing that I might be getting more results than I might have if I were to use a larger Ngram index).

What is the recommended way of achieving this functionality? Note that I am using query string query.

Upvotes: 1

Views: 1568

Answers (1)

uı6ʎɹnɯ ꞁəıuɐp
uı6ʎɹnɯ ꞁəıuɐp

Reputation: 3481

Try the solution with is described here: Exact Substring Searches in ElasticSearch

{
    "mappings": {
        "my_type": {
            "index_analyzer":"index_ngram",
            "search_analyzer":"search_ngram"
        }
    },
    "settings": {
        "analysis": {
            "filter": {
                "ngram_filter": {
                    "type": "ngram",
                    "min_gram": 3,
                    "max_gram": 8
                }
            },
            "analyzer": {
                "index_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": [ "ngram_filter", "lowercase" ]
                },
                "search_ngram": {
                    "type": "custom",
                    "tokenizer": "keyword",
                    "filter": "lowercase"
                }
            }
        }
    }
}

To solve the disk usage problem and the too-long search term problem short 8 characters long ngrams are used (configured with: "max_gram": 8). To search for terms with more than 8 characters, turn your search into a boolean AND query looking for every distinct 8-character substring in that string. For example, if a user searched for large yard (a 10-character string), the search would be:

"arge ya AND arge yar AND rge yard.

Upvotes: 2

Related Questions