Reputation: 22553
I'm working on implementing a synonym query for colors in a product catalog using elastic search and I've been asking some consultants to implement it using the ES synonyms feature.
They tell me that a color might have hundreds of synonyms (white: ivory, creme, putty, etc) and that we should do the mapping in our operational database. I am not convinced. Would there really be huge performance hit if we had a list of, say, one hundred synonyms for white at query time? If that were slow, would doing the synonym mapping when indexing the documents obviate the problem?
The consultants want us to do the mapping in reverse, assigning a standard color to our items in our primary database and then pass that on to ES. I'd prefer not to have them learn anything about our architecture/infrastructure and just have them twiddle the knobs in ES which they already know how to do.
Am I naive in thinking we can proceed in this way? Is decorating or operational database with standard colors really the way to go?
Upvotes: 3
Views: 1879
Reputation: 52366
The way I'd do it is to define a file of synonyms, as described in the documentation here and maintain that file.
With this one I'd create my custom token filter and use them at indexing time. Probably not a huge performance hit if you'd do this at query time, but it's better to do it at indexing time. The response time at query time will be better.
Regarding your database, I don't know your architecture and I don't know why they say you need to put the synonyms there. As you see in the link I provided above, you can define a simple text file where you put something like:
ivory, creme, putty => white
...
This means that for any ivory
, creme
, putty
found at indexing time, ES will actually index white
and that's it.
And the analyzer would look like this:
"analyzer" : {
"synonym" : {
"tokenizer" : "whitespace",
"filter" : ["synonym"]
}
},
"filter" : {
"synonym" : {
"type" : "synonym",
"synonyms_path" : "analysis/synonym.txt"
}
}
But depending on what queries you want to run and what you need to match a query time, you can define an index_analyzer
and a search_analyzer
, use contraction or expansion so, for the "right" solution, more variables need to be looked at, not only what you mentioned.
In my approach above, I basically made equal all the synonyms of "white" at indexing time. But, maybe you don't need this, given the queries you want to run.
In conclusion:
Upvotes: 8