Cao Dongping
Cao Dongping

Reputation: 989

How to token a word which combined by two words without whitespace

I have a word like lovelive, which is combined by two simple words love and live without whitespace.

I wanna know which kind of Lucene Analyzer can token this kind of words into two separate words?

Upvotes: 4

Views: 1561

Answers (1)

cheffe
cheffe

Reputation: 9500

Have a look at the DictionaryCompoundWordTokenFilter as described in the solr reference

This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.

In: "Donaudampfschiff dummkopf"

Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),

Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)

As you can see in the sample configuration, you will need a dictionary in the language you want to split, in the sample there they use a germanwords.txt that contains the words they want to decompose, if found composed. In your case this would be love and live.

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>

For Lucene it is org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter. The code is to be found on github.

Upvotes: 4

Related Questions