Why is Lucene sometimes not matching InChIKeys?

Question

I have indexed my database using Hibernate Search. I use a custom analyzer, both for indexing and for querying. I have a field called inchikey that should not get tokenized. Example values are:

BBBAWACESCACAP-UHFFFAOYSA-N
KEZLDSPIRVZOKZ-AUWJEWJLSA-N

When I look into my index with Luke I can confirm that they are not tokenized, as required.

However, when I try to search them using the web app, some inchikeys are found and others are not. Curiously, for these inchikeys the search DOES work when I search without the last hyphen, as so: BBBAWACESCACAP-UHFFFAOYSA N

I have not been able to find a common element in the inchikeys that are not found.

Any idea what is going on here?

I use a MultiFieldQueryParser to search over the different fields in the database:

    String[] searchfields = Compound.getSearchfields();
    MultiFieldQueryParser parser = new MultiFieldQueryParser(Version.LUCENE_29, Compound.getSearchfields(), new ChemicalNameAnalyzer());
    //Disable the following if search performance is too slow
    parser.setAllowLeadingWildcard(true);
    FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(parser.parse("searchterms"), Compound.class);
    List hits = fullTextQuery.list();

More details about our setup have been posted here by Tim and I.

Simmer · Accepted Answer

It turns out the last entries in the input file are not being indexed correctly. These ARE being tokenized. In fact, it seems they are indexed twice: once without being tokenized and once with. When I search I cannot find the un-tokenized.

I have not yet found the reason, but I think it perhaps has to do with our parser ending while Lucene is still indexing the last entries, and as a result Lucene reverting to the default analyzer (StandardAnalyzer). When I find the culprit I will report back here.

Adding @Analyzer(impl = ChemicalNameAnalyzer.class) to the fields solves the problem, but what I want is my original setup, with the default analyzer defined once, in config, like so:

path.to.ChemicalNameAnalyzer

Why is Lucene sometimes not matching InChIKeys?

Answers (1)

Related Questions