Reputation: 3057
I have the following analyzer (a slight tweak to the way snowball would be setup):
string_analyzer: {
filter: [ "standard", "stop", "snowball" ],
tokenizer: "lowercase"
}
Here is the field it is applied to:
indexes :title, type: 'string', analyzer: 'string_analyzer'
query do
match ['title'], search_terms, fuzziness: 0.5, max_expansions: 10, operator: 'and'
end
I have a record in my index with title foo bar
.
If I search for foo bar
it appears in the results.
However, if I search for foobar
it doesn't.
Can someone explain why and if possible how I could get it to?
Can someone explain how I could get the reverse of this to work as well so that if I had a record with title foobar
a user could search for foo bar
and see it as a result?
Thanks
Upvotes: 4
Views: 826
Reputation: 17319
You can only search for tokens that are in your index. So let's look at what you are indexing.
You're currently using the lowercase
tokenizer (which tokenizes a string on non-letter characters and lowercases them) then applying the standard
filter (redundant, because you are not using the standard
tokenizer), the stop
and snowball
filters.
If we create that analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"analyzer" : {
"string_analyzer" : {
"filter" : [
"standard",
"stop",
"snowball"
],
"tokenizer" : "lowercase"
}
}
}
}
}
'
and use the analyze
API to test it out:
curl -XGET 'http://127.0.0.1:9200/test/_analyze?pretty=1&text=foo+bar&analyzer=string_analyzer'
you'll see that "foo bar"
produces the terms ["foo","bar"]
and "foobar"
produces the term ["foobar"]
. So indexing "foo bar"
and searching for "foobar"
currently cannot work.
If you want to be able to search "inside" words, then you need to break words up into smaller tokens. To do this, we use the ngram
analyzer.
So delete the test index:
curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'
and specify a new analyzer:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"settings" : {
"analysis" : {
"filter" : {
"ngrams" : {
"max_gram" : 5,
"min_gram" : 1,
"type" : "ngram"
}
},
"analyzer" : {
"ngrams" : {
"filter" : [
"standard",
"lowercase",
"ngrams"
],
"tokenizer" : "standard"
}
}
}
}
}
'
Now, if we test the analyzer, we get:
"foo bar" => [f,o,o,fo,oo,foo,b,a,r,ba,ar,bar]
"foobar" => [f,o,o,b,a,r,fo,oo,ob,ba,ar,foo,oob,oba,bar,foob,ooba,obar,fooba,oobar]
So if we index "foo bar"
and we search for "foobar"
using the match
query, then the query becomes a query looking for any of those tokens, some of which exist in the index.
Unfortunately, it'll also overlap with "wear the fox hat"
(f
,o
,a
). While foobar
will appear higher up the list of results because it has more tokens in common, you will still get apparently unrelated results.
This can be controlled by using the minimum_should_match
parameter, eg:
curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1' -d '
{
"query" : {
"match" : {
"my_field" : {
"minimum_should_match" : "60%",
"query" : "foobar"
}
}
}
}
'
The exact value for minimim_should_match
depends upon your data - experiment with it.
Upvotes: 4