Yossi Vainshtein
Yossi Vainshtein

Reputation: 3995

lucene spanquery matching same word

I have a system using lucene to search documents according to queries given by the user. When the user's query contains more than one word, I create a SpanNearQuery with each word as term, and the last term is a wrapper for prefix query (and span=0). For example if user input is "new y" this should match both "new year" and "new york"

This works fine, however if the query has the same word twice, e.g "bora bora", even documents with one appearence of "bora" are matched.

How can I match only "bora bora*"?

code :

String[] words = querystr.split(" ");           
SpanQuery[] clauses = new SpanQuery[words.length];
for (int i = 0; i < words.length; i++) {                
   if (allWordsPrefix || i == words.length - 1)
   {
        PrefixQuery pq = new PrefixQuery(new Term(LOWER_VALUE, words[i])); 
        clausesWildCard[i] = new SpanMultiTermQueryWrapper<PrefixQuery>(  
   }
   else
   {
        Term clause = new Term(LOWER_VALUE, words[i]); 
        clausesWildCard[i] = new SpanTermQuery(clause);
   }                
}
SpanQuery allTheWords = new SpanNearQuery(clausesWildCard, 0, false);

EDIT: I have found this seems to be a known issue https://issues.apache.org/jira/browse/LUCENE-5932 https://issues.apache.org/jira/browse/LUCENE-3120

but i don't understand whether this is solved or has a workaround.

Upgraded to lucene 5.0.0 but it keeps hapenning...

Upvotes: 1

Views: 219

Answers (1)

Necreaux
Necreaux

Reputation: 9786

Are you using the shingle filter when building your index? It was the solution I came up with. In a nutshell every consecutive word pair had to be index. So for example (ignoring stop words), "The quick brown fox jumps over the lazy dog" in addition to having each of the words indexed would also have "The quick", "quick brown, "brown fox", "fox jumps". etc.

Perhaps someone else might have a better solution.

Upvotes: 0

Related Questions