Felipe Hummel
Felipe Hummel

Reputation: 4774

Indexing n-word expressions as a single term in Lucene

I want to index a "compound word" like "New York" as a single term in Lucene not like "new", "york". In such a way that if someone searches for "new place", documents containing "new york" won't match.

I think this is not the case for N-grams (actually NGramTokenizer), because I won't index just any n-gram, I want to index only some specific n-grams.

I've done some research and I know I should write my own Analyzer and maybe my own Tokenizer. But I'm a bit lost extending TokenStream/TokenFilter/Tokenizer.

Thanks

Upvotes: 4

Views: 534

Answers (2)

Jakub
Jakub

Reputation: 409

I did it by creating the field which is indexed but not analyzed. For this I used the Field.Index.NOT_ANALYZED > doc.add(new Field("fieldName", "value", Field.Store.YES, Field.Index.NOT_ANALYZED, TermVector.YES)); the StandardAnalyzer.

I worked on Lucene 3.0.2.

Upvotes: 0

Fred Foo
Fred Foo

Reputation: 363487

I presume you have some way of detecting the multi-word units (MWUs) that you want to preserve. Then what you can do is replace the whitespace in them by an underscore and use a WhiteSpaceAnalyzer instead of a StandardAnalyzer (which throws away punctuation), perhaps with a LowerCaseFilter.

Writing your own Tokenizer requires quite some Lucene black magic. I've never been able to wrap my head around the Lucene 2.9+ APIs, but check out the TokenStream docs if you really want to try.

Upvotes: 1

Related Questions