Reputation: 4774
I want to index a "compound word" like "New York" as a single term in Lucene not like "new", "york". In such a way that if someone searches for "new place", documents containing "new york" won't match.
I think this is not the case for N-grams (actually NGramTokenizer), because I won't index just any n-gram, I want to index only some specific n-grams.
I've done some research and I know I should write my own Analyzer and maybe my own Tokenizer. But I'm a bit lost extending TokenStream/TokenFilter/Tokenizer.
Thanks
Upvotes: 4
Views: 534
Reputation: 409
I did it by creating the field which is indexed but not analyzed. For this I used the Field.Index.NOT_ANALYZED > doc.add(new Field("fieldName", "value", Field.Store.YES, Field.Index.NOT_ANALYZED, TermVector.YES)); the StandardAnalyzer.
I worked on Lucene 3.0.2.
Upvotes: 0
Reputation: 363487
I presume you have some way of detecting the multi-word units (MWUs) that you want to preserve. Then what you can do is replace the whitespace in them by an underscore and use a WhiteSpaceAnalyzer
instead of a StandardAnalyzer
(which throws away punctuation), perhaps with a LowerCaseFilter
.
Writing your own Tokenizer
requires quite some Lucene black magic. I've never been able to wrap my head around the Lucene 2.9+ APIs, but check out the TokenStream
docs if you really want to try.
Upvotes: 1