Reputation: 989
I have a word like lovelive
, which is combined by two simple words love
and live
without whitespace.
I wanna know which kind of Lucene Analyzer can token this kind of words into two separate words?
Upvotes: 4
Views: 1561
Reputation: 9500
Have a look at the DictionaryCompoundWordTokenFilter
as described in the solr reference
This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.
In: "Donaudampfschiff dummkopf"
Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),
Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)
As you can see in the sample configuration, you will need a dictionary in the language you want to split, in the sample there they use a germanwords.txt
that contains the words they want to decompose, if found composed. In your case this would be love
and live
.
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>
For Lucene it is org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter
. The code is to be found on github.
Upvotes: 4