Reputation: 147
I'm trying to index some tags after stemming them and applying other filters. These tags could be composed of multiple words.
The thing I'm not managing to do though is to apply a final token filter which outputs a single token from the token stream.
So I would like tags made up of multiple words to be stemmed, stopwords removed, but then be joined again in the same token before being saved in the index (sort of what the keyword tokenizer does, but as a filter).
I find no way of doing this with the way token filters are applied in Elasticsearch: if I tokenize on white spaces, then stem, all of the subsequent token filters would receive these stemmed single tokens, and not the entire token stream, right?
For example I would like the tag
the fox jumps over the fence
to be saved in the index as a whole token as
fox jump over fence
and not
fox,jump,over,fence
Is there any way of doing this without preprocessing the string in my application and then indexing it as a not_analyzed field?
Upvotes: 4
Views: 2064
Reputation: 1868
Providing an up to date answer in case someone comes across this looking for a solution. If your use case is aggregating, what OP suggests they'd need to do:
Is there any way of doing this without preprocessing the string in my application and then indexing it as a not_analyzed field?
is actually the best way to solve this problem now that Elasticsearch uses the keyword
and text
types for mapping instead of just the string
type, and suggests using multi fields (one keyword
and one text
) for aggregation use cases where you need to do full text search (https://www.elastic.co/guide/en/elasticsearch/reference/7.12/text.html#fielddata-mapping-param).
In modern versions of Elasticsearch, it'll even refuse to perform the aggregation on the text
field unless fielddata
is explicitly set to true
in the mapping, warning you about the performance problem you're about to run into if you don't go with a multi field instead.
Modern versions of Elasticsearch also provide facilities for preprocessing your data into multiple fields within the cluster if it's a pain to do it before it's indexed (https://www.elastic.co/guide/en/elasticsearch/reference/7.12/ingest.html).
Upvotes: 0
Reputation: 147
After a bit of research I found this thread:
http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html
which had the exact solution I was looking for.
I created a simple Elasticsearch plugin that only provides the Concatenate Token Filter, which you can find at:
https://github.com/francesconero/elasticsearch-concatenate-token-filter
Upvotes: 2