adrianogf
adrianogf

Reputation: 171

Search for a specific term in a Lucene index

I'm trying to conduct a search on a Lucene index with some specific words that I know are indexed but the result is not very good.

How do I perform a query to a specific term ("129202")? I've tried adding the plus sign at the beginning of the string but it did not work.

My query:

QueryParser q = new QueryParser(Version.LUCENE_42, "tags", new SimpleAnalyzer(Version.LUCENE_42));
Query query = q.parse("sapatilha feminina ramarim 129202 cinza");

Below is a document (xml) indexed that I want to get

<?xml version="1.0" encoding="UTF-8"?>
<product>
 <tags>
   <tag>Sapatilha Pedras Preto</tag>
   <tag>ramarin</tag>
   <tag>ramarin 129202</tag>
   <tag>preto</tag>
 </tags>
 <id>71</id>
 <url>http://www.dafiti.com.br/Sapatilha-Pedras-Preto-1135428.html</url>
</product>

Upvotes: 1

Views: 147

Answers (1)

femtoRgon
femtoRgon

Reputation: 33341

SimpleAnalyzer, the analyzer you are using to query (and I assume to index), uses a LetterTokenizer, which, according to documentation:

...defines tokens as maximal strings of adjacent letters, as defined by java.lang.Character.isLetter()

Which is to say, not numbers. Numbers will be lost entirely by this analyzer. I recommend you look into a different one, such as StandardAnalyzer or WhitespaceAnalyzer.


To demonstrate:

StringReader reader = new StringReader("ramarim 129202 cinza");
LetterTokenizer stream = new LetterTokenizer(Version.LUCENE_42, reader);        
stream.setReader(reader);
stream.reset();
while(stream.incrementToken()) {
    System.out.println(stream.reflectAsString(false));
}
stream.close();

Outputs:

term=ramarim,bytes=[72 61 6d 61 72 69 6d],startOffset=19,endOffset=26
term=cinza,bytes=[63 69 6e 7a 61],startOffset=34,endOffset=39

Substituting in StandardTokenizer (which is used by StandardAnalyzer) will get you:

term=ramarim,bytes=[72 61 6d 61 72 69 6d],startOffset=19,endOffset=26,positionIncrement=1,type=<ALPHANUM>
term=129202,bytes=[31 32 39 32 30 32],startOffset=27,endOffset=33,positionIncrement=1,type=<NUM>
term=cinza,bytes=[63 69 6e 7a 61],startOffset=34,endOffset=39,positionIncrement=1,type=<ALPHANUM>

Upvotes: 1

Related Questions