Reputation: 350
I have the following synonym expansion :
suco => suco, refresco, bebida de soja
What i want is to tokenize the search this way:
Search for "suco de laranja" would be tokenized to ["suco", "laranja", "refresco", "bebida de soja"].
But i'm getting it tokenized to ["suco", "laranja", "refresco", "bebida", "soja"].
Consider that the "de" word is a stop word. And i want it to be ignored on the query like "bebida de laranja" becomes ["bebida", "laranja"]. But i don't want it to be considered on the synonym tokenization so "bebida de soja" still stays as one token "bebida de soja".
my settings :
{
"settings":{
"analysis":{
"filter":{
"synonym_br":{
"type":"synonym",
"synonyms":[
"suco => suco, refresco, bebida de soja"
]
},
"brazilian_stop":{
"type":"stop",
"stopwords":"_brazilian_"
}
},
"analyzer":{
"synonyms":{
"filter":[
"synonym_br",
"lowercase",
"brazilian_stop",
"asciifolding"
],
"type":"custom",
"tokenizer":"standard"
}
}
}
}
}
Upvotes: 2
Views: 1776
Reputation: 7874
I would suggest you to make following two changes. First one directly relates to the question you asked and the second one is a suggestion.
Instead of using expansion of multiple synonyms, do the opposite i.e. all the synonyms points to a single word synonym. So, change "suco => suco, refresco, bebida de soja"
to "suco, refresco, bebida de soja => suco"
Change the order of filters in synonyms
analyzer. Place lowercase
before synonym_br
. This will ensure that case does't effect synonym_br
token filter.
So final settings will be:
{
"settings": {
"analysis": {
"filter": {
"synonym_br": {
"type": "synonym",
"synonyms": [
"suco, refresco, bebida de soja => suco"
]
},
"brazilian_stop": {
"type": "stop",
"stopwords": "_brazilian_"
}
},
"analyzer": {
"synonyms": {
"filter": [
"lowercase",
"synonym_br",
"brazilian_stop",
"asciifolding"
],
"type": "custom",
"tokenizer": "standard"
}
}
}
}
}
For input bebida de soja
filter apply in the following order:
Input Filter Result tokens
====================================
lowercase bebida, de, soja
synonym_br suco <------- all the above tokens(including position) exactly matches a synonym
brazilian_stop suco
asciifolding suco
Let's see brazilian_stop
in action. For this we need an input which doesn't match the synonym but have de
in it. E.g. de soja
:
Input Filter Result tokens
=================================
lowercase de, soja
synonym_br de, soja <------- none of the tokens (independently or combined(including position)) matches any synonym
brazilian_stop soja <------- de is removed as it is a stopword
asciifolding soja
Upvotes: 2