Robert Moskal
Robert Moskal

Reputation: 22553

What is the best synonym approach for elastic search?

I'm working on implementing a synonym query for colors in a product catalog using elastic search and I've been asking some consultants to implement it using the ES synonyms feature.

They tell me that a color might have hundreds of synonyms (white: ivory, creme, putty, etc) and that we should do the mapping in our operational database. I am not convinced. Would there really be huge performance hit if we had a list of, say, one hundred synonyms for white at query time? If that were slow, would doing the synonym mapping when indexing the documents obviate the problem?

The consultants want us to do the mapping in reverse, assigning a standard color to our items in our primary database and then pass that on to ES. I'd prefer not to have them learn anything about our architecture/infrastructure and just have them twiddle the knobs in ES which they already know how to do.

Am I naive in thinking we can proceed in this way? Is decorating or operational database with standard colors really the way to go?

Upvotes: 3

Views: 1879

Answers (1)

Andrei Stefan
Andrei Stefan

Reputation: 52366

The way I'd do it is to define a file of synonyms, as described in the documentation here and maintain that file.

With this one I'd create my custom token filter and use them at indexing time. Probably not a huge performance hit if you'd do this at query time, but it's better to do it at indexing time. The response time at query time will be better.

Regarding your database, I don't know your architecture and I don't know why they say you need to put the synonyms there. As you see in the link I provided above, you can define a simple text file where you put something like:

ivory, creme, putty => white
...

This means that for any ivory, creme, putty found at indexing time, ES will actually index white and that's it.

And the analyzer would look like this:

       "analyzer" : {
            "synonym" : {
                "tokenizer" : "whitespace",
                "filter" : ["synonym"]
            }
        },
        "filter" : {
            "synonym" : {
                "type" : "synonym",
                "synonyms_path" : "analysis/synonym.txt"
            }
        }

But depending on what queries you want to run and what you need to match a query time, you can define an index_analyzer and a search_analyzer, use contraction or expansion so, for the "right" solution, more variables need to be looked at, not only what you mentioned. In my approach above, I basically made equal all the synonyms of "white" at indexing time. But, maybe you don't need this, given the queries you want to run.

In conclusion:

  • I don't see why the colors need to be held in a database, they can very well be specified in a text file, as you saw above. Maybe I don't have all the details of your use case.
  • The final solution might involve analyzing the input text from the query itself or analyzing the text at indexing time, or both. This all depends on your specific use case and your queries.
  • Test the various solutions on real data and real volume and compare the performance you get.
  • Usually, the synonyms approach is at indexing time, but not necessarily and it depends on the use case.

Upvotes: 8

Related Questions