How to token a word which combined by two words without whitespace

Question

I have a word like lovelive, which is combined by two simple words love and live without whitespace.

I wanna know which kind of Lucene Analyzer can token this kind of words into two separate words?

cheffe · Accepted Answer

Have a look at the DictionaryCompoundWordTokenFilter as described in the solr reference

This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.

In: "Donaudampfschiff dummkopf"

Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),

Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)

As you can see in the sample configuration, you will need a dictionary in the language you want to split, in the sample there they use a germanwords.txt that contains the words they want to decompose, if found composed. In your case this would be love and live.

For Lucene it is org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter. The code is to be found on github.

How to token a word which combined by two words without whitespace

Answers (1)

Related Questions