Reputation: 270
I am trying to write a query in ElasticSearch which matches contiguous characters in the words. So, if my index has "John Doe", I should still see "John Doe" returned by Elasticsearch for the below searches.
I have tried the below query so far.
{
"query": {
"multi_match": {
"query": "term",
"operator": "OR",
"type": "phrase_prefix",
"max_expansions": 50,
"fields": [
"Field1",
"Field2"
]
}
}
}
But this also returns unnessary matches like I will still get "John Doe" when i type john x.
Upvotes: 6
Views: 10090
Reputation: 279
Here is a updated fix
create index with
body = {
"settings": {
"analysis": {
"analyzer": {
"autocomplete": {
"tokenizer": "autocomplete",
"filter": [
"lowercase"
]
},
"autocomplete_search": {
"tokenizer": "lowercase"
}
},
"tokenizer": {
"autocomplete": {
"type": "edge_ngram",
"min_gram": 2,
"max_gram": 10,
"token_chars": [
"letter"
]
}
}
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "autocomplete",
"search_analyzer": "autocomplete_search"
}
}
}
}
Upvotes: 0
Reputation: 217254
As explained in my comment above, prefix wildcards should be avoided at all costs as your index grows since that will force ES to do full index scans. I'm still convinced that ngrams (more precisely edge-ngrams) is the way to go, so I'm taking a stab at it below.
The idea is to index all the suffixes of the input and then use a prefix
query to match any suffix as searching for prefixes doesn't suffer the same performance issues as searching for suffixes. So the idea is to index john doe
as follows:
john doe
ohn doe
hn doe
n doe
doe
oe
e
That way, using a prefix
query we can match any sub-part of those tokens which effectively achieves the goal of matching partial contiguous words while at the same time ensuring good performance.
The definition of the index would go like this:
PUT my_index
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [
"lowercase",
"reverse",
"suffixes",
"reverse"
]
}
},
"filter": {
"suffixes": {
"type": "edgeNGram",
"min_gram": 1,
"max_gram": 20
}
}
}
}
},
"mappings": {
"doc": {
"properties": {
"name": {
"type": "text",
"analyzer": "my_analyzer",
"search_analyzer": "standard"
}
}
}
}
}
Then we can index a sample document:
PUT my_index/doc/1
{
"name": "john doe"
}
And finally all of the following searches will return the john doe
document:
POST my_index/_search
{
"query": {
"prefix": {
"name": "john doe"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "john do"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "ohn do"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "john"
}
}
}
POST my_index/_search
{
"query": {
"prefix": {
"name": "n doe"
}
}
}
Upvotes: 14
Reputation: 270
This is what worked for me. Instead of an ngram, index your data as keyword. And use wildcard regex match to match the words.
"query": {
"bool": {
"should": [
{
"wildcard": { "Field1": "*" + term + "*" }
},
{
"wildcard": { "Field2": "*" + term + "*" }
}
],
"minimum_should_match": 1
}
}
Upvotes: 2