Martinffx
Martinffx

Reputation: 2476

Searching for Terms with whitespace using Lucene

I'm trying to use Lucene to add a search feature but can't seem to get an index to work with significant whitespace. I've got the following test case setup:

RAMDirectory directory = new RAMDirectory();
KeywordAnalyzer analyzer = new KeywordAnalyzer();
IndexWriterConfig config = new IndexWriterConfig(analyzer);
IndexWriter writer = new IndexWriter(directory, config);
Document doc = new Document();
doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
writer.addDocument(doc);
writer.close();

IndexReader reader = DirectoryReader.open(directory);
IndexSearcher searcher = new IndexSearcher(reader);

QueryParser parser = new QueryParser("content", analyzer);
parser.setSplitOnWhitespace(false);
Query query = parser.parse("Bill E");

TopDocs docs = searcher.search(query, 1);
assertTrue(docs.totalHits > 0);

I'm using Lucene 6.6.0 and from what I understand the KeywordAnalyzer is what I'm looking for:

"Tokenizes" the entire stream as a single token. This is useful for data like zip codes, ids, and some product names.

But I can't seem to get any matching documents that contain whitespace.

Any ideas on how to solve this?

Upvotes: 2

Views: 2786

Answers (1)

Sabir Khan
Sabir Khan

Reputation: 10142

When you index, you have a single document with a single field and with a single term with value - Bill Evans

When you are going to search, TermQuery produced by QueryParser tries to search with term value - Bill E and that term obviously doesn't exist in index so you get zero hits.

if you replace your search string with - Bill Evans , you will get results.

Please refer this question too

First , you need to separate your indexing and searching concerns. You can only search what is indexed. If you are indexing full texts without breaking into tokens then at search times - you need to produce WildCardQuery , FuzzyQuery , PhraseQuery etc if your input string at search time is different than what in indexed. TermQuery searches for exact term values.

My suggestion would to be to store full text value ( without tokens - StringField would do that ) as well as generate additional tokens breaking on space using something like - SimpleAnalyzer .

So Something like,

doc.add(new TextField("content", "Bill Evans", Field.Store.NO));
doc.add(new StringField("storedcontent", "Bill Evans", Field.Store.YES));

Above code with SimpleAnalyzer , you will now have terms - bill & evans ( as well as full text as stored field ) and if you now search with same analyzer , your query would be like - content:bill content:e & you will get a result.

All in all - system is working the way you have coded it :)

So understand your requirements first as what you wish to index and what kind of queries you wish to perform on that index.

Upvotes: 2

Related Questions