Mr Morgan
Mr Morgan

Reputation: 153

How to index word with hyphen in Lucene?

I have a StandardAnalyzer working which retrieves words and frequencies from a single document using a TermVectorMapper which is populating a HashMap.

But if I use the following text as a field in my document, i.e.

addDoc(w, "lucene Lawton-Browne Lucene");

The word frequencies returned in the HashMap are:

browne 1 lucene 2 lawton 1

The problem is the words ‘lawton’ and ‘browne’. If this is an actual ‘double-barreled’ name, can Lucene recognise it as ‘Lawton-Browne’ where the name is actually a single word?

I’ve tried combinations of:

addDoc(w, "lucene \”Lawton-Browne\” Lucene");

And single quotes but without success.

Thanks

Mr Morgan.

Upvotes: 4

Views: 2616

Answers (2)

csupnig
csupnig

Reputation: 3377

If you still want to be able to use a stop words list, I suggest you try the PatternAnalyzer. It allows for such a list and has a prefilled whitespace pattern.

Or you wrap the whitespace analyzer and do something like this in the tokenStream(String fieldName, Reader reader) you do something like this:

public TokenStream tokenStream(String fieldName, Reader reader) {
  TokenStream stream = myWhitespaceAnalyzer.tokenStream(fieldName, Reader);
  stream = new StopFilter(stream, stopWords);
  return stream;
}

Upvotes: 1

Aaron Saunders
Aaron Saunders

Reputation: 33345

Escape the characters

see Lucene Documentation here

http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Escaping%20Special%20Characters

Upvotes: 0

Related Questions