Francesco
Francesco

Reputation: 147

Merge token filter in Elasticsearch

I'm trying to index some tags after stemming them and applying other filters. These tags could be composed of multiple words.

The thing I'm not managing to do though is to apply a final token filter which outputs a single token from the token stream.

So I would like tags made up of multiple words to be stemmed, stopwords removed, but then be joined again in the same token before being saved in the index (sort of what the keyword tokenizer does, but as a filter).

I find no way of doing this with the way token filters are applied in Elasticsearch: if I tokenize on white spaces, then stem, all of the subsequent token filters would receive these stemmed single tokens, and not the entire token stream, right?

For example I would like the tag

the fox jumps over the fence

to be saved in the index as a whole token as

fox jump over fence

and not

fox,jump,over,fence

Is there any way of doing this without preprocessing the string in my application and then indexing it as a not_analyzed field?

Upvotes: 4

Views: 2064

Answers (2)

Matt Welke
Matt Welke

Reputation: 1868

Providing an up to date answer in case someone comes across this looking for a solution. If your use case is aggregating, what OP suggests they'd need to do:

Is there any way of doing this without preprocessing the string in my application and then indexing it as a not_analyzed field?

is actually the best way to solve this problem now that Elasticsearch uses the keyword and text types for mapping instead of just the string type, and suggests using multi fields (one keyword and one text) for aggregation use cases where you need to do full text search (https://www.elastic.co/guide/en/elasticsearch/reference/7.12/text.html#fielddata-mapping-param).

In modern versions of Elasticsearch, it'll even refuse to perform the aggregation on the text field unless fielddata is explicitly set to true in the mapping, warning you about the performance problem you're about to run into if you don't go with a multi field instead.

Modern versions of Elasticsearch also provide facilities for preprocessing your data into multiple fields within the cluster if it's a pain to do it before it's indexed (https://www.elastic.co/guide/en/elasticsearch/reference/7.12/ingest.html).

Upvotes: 0

Francesco
Francesco

Reputation: 147

After a bit of research I found this thread:

http://elasticsearch-users.115913.n3.nabble.com/Is-there-a-concatenation-filter-td3711094.html

which had the exact solution I was looking for. 

I created a simple Elasticsearch plugin that only provides the Concatenate Token Filter, which you can find at:

https://github.com/francesconero/elasticsearch-concatenate-token-filter

Upvotes: 2

Related Questions