Reputation: 1767
I'm using Elasticsearch to search over a fairly broad range of documents, and I'm having trouble finding best practices for dealing with hyphenated words.
In my data, words frequently appear either hyphenated or as compound words, e.g. pre-eclampsia
and preeclampsia
. At the moment, searching for one won't find the other (the standard
tokenizer indexes the hyphenated version as pre eclampsia
).
This specific case could easily be fixed by stripping hyphens in a character filter. But often I do want to tokenize on hyphens: searches for jean claude
and happy go lucky
should match jean-claude
and happy-go-lucky
.
One approach to solving this is in the application layer, by essentially transforming any query for hyphenated-word
into hyphenated-word OR hyphenatedword
. But is there any way of dealing with all these use cases within the search engine, e.g. with some analyzer configuration? (Assume that my data is large and varied enough that I can't manually create exhaustive synonym files.)
Upvotes: 2
Views: 669
Reputation: 52862
You can use a compound word token filter - hyphenation_decompounder
should probably work decent enough.
It seems like your index consists of many domain specific words that isn't necessarily in a regular English dictionary, so I'd spend some time creating my own dictionary first with the words that are important to your domain. This can be based on domain specific literature, taxonomies, etc. The dictionary_decompounder
is suitable for doing stuff like that.
This assumes that your question was relevant to Elasticsearch and not Solr, where the filter is named DictionaryCompoundWordTokenFilter instead.
Upvotes: 1