Marsellus Wallace
Marsellus Wallace

Reputation: 18601

Java Lucene Filters seem not to alter the query search space as expected (by me)

THIS QUESTION HAS BEEN EDITED AFTER NEW FINDINGS

I am using DefaultSimilarity (TFIDF) to search a sample index with 4 documents. When using a filtered query I noticed that while it reduces the number of results correctly it does not alter the document scores. This made me very suspicious...

So I extended DefaultSimilarity to print out the tf*idf values of: term_frequency, total_number_of_documents, and document_frequency and I indeed confirmed that these values do not change at all. I was expecting numDocs and docFreq to reflect the smaller search space introduced by the filter. (Read here if you have time)

This is my document collection (text is a TextField):

id=0 type=type.colors title=This is a black dog
id=1 type=type.pets title=This is a black cat
id=2 type=type.colors title=The cat is white
id=3 type=type.pets title=The cat is black

When I search for "black":

Query query = parser.parse("black");
TopDocs results = searcher.search(query, 5);

I get numDocs=4 and docFreq=3 as expected.

I then tried to reduce the search space in the following ways:

1)

PrefixFilter prefixFilter = new PrefixFilter(new Term("type", "type.colors"));
TopDocs results = searcher.search(query, prefixFilter, 5);

2)

PrefixQuery categoryQuery = new PrefixQuery(new Term("type", "type.colors"));
QueryWrapperFilter categoryFilter = new QueryWrapperFilter(categoryQuery);
TopDocs results = searcher.search(query, categoryFilter, 5);

3)

BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new PrefixQuery(new Term("type", "type.colors")), Occur.MUST);
booleanQuery.add(blackQuery, Occur.MUST);

And I always end up with the same values of numDocs and docFreq. (instead of numDocs=2 and docFreq=1 since the search space should have been reduced to 2 documents and only 1 of them contains "black");

It seems that either those values are pre-calculated at index creation or that the filter is applied after the query returns. I'm not happy with either of the alternatives...

How can I have Lucene calculate those values after the filter has been applied?

Full gist here

Upvotes: 2

Views: 252

Answers (1)

femtoRgon
femtoRgon

Reputation: 33351

Scores aren't really comparable between different queries. The fact that you get the same score between two different queries isn't really a meaningful result. You are getting the correct results in the correct order. The fact that they happen to be equal just gets into implementation details. Scores are only comparable between documents returned as part of the same query.

You can call IndexSearcer.explain, to get a better unnderstanding of why things get the scores they get.

Upvotes: 1

Related Questions