Reputation: 18601
THIS QUESTION HAS BEEN EDITED AFTER NEW FINDINGS
I am using DefaultSimilarity
(TFIDF) to search a sample index with 4 documents. When using a filtered query I noticed that while it reduces the number of results correctly it does not alter the document scores. This made me very suspicious...
So I extended DefaultSimilarity
to print out the tf*id
f values of: term_frequency
, total_number_of_documents
, and document_frequency
and I indeed confirmed that these values do not change at all. I was expecting numDocs
and docFreq
to reflect the smaller search space introduced by the filter. (Read here if you have time)
This is my document collection (text is a TextField):
id=0 type=type.colors title=This is a black dog
id=1 type=type.pets title=This is a black cat
id=2 type=type.colors title=The cat is white
id=3 type=type.pets title=The cat is black
When I search for "black":
Query query = parser.parse("black");
TopDocs results = searcher.search(query, 5);
I get numDocs=4
and docFreq=3
as expected.
I then tried to reduce the search space in the following ways:
1)
PrefixFilter prefixFilter = new PrefixFilter(new Term("type", "type.colors"));
TopDocs results = searcher.search(query, prefixFilter, 5);
2)
PrefixQuery categoryQuery = new PrefixQuery(new Term("type", "type.colors"));
QueryWrapperFilter categoryFilter = new QueryWrapperFilter(categoryQuery);
TopDocs results = searcher.search(query, categoryFilter, 5);
3)
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new PrefixQuery(new Term("type", "type.colors")), Occur.MUST);
booleanQuery.add(blackQuery, Occur.MUST);
And I always end up with the same values of numDocs and docFreq. (instead of numDocs=2 and docFreq=1
since the search space should have been reduced to 2 documents and only 1 of them contains "black");
It seems that either those values are pre-calculated at index creation or that the filter is applied after the query returns. I'm not happy with either of the alternatives...
How can I have Lucene calculate those values after the filter has been applied?
Full gist here
Upvotes: 2
Views: 252
Reputation: 33351
Scores aren't really comparable between different queries. The fact that you get the same score between two different queries isn't really a meaningful result. You are getting the correct results in the correct order. The fact that they happen to be equal just gets into implementation details. Scores are only comparable between documents returned as part of the same query.
You can call IndexSearcer.explain, to get a better unnderstanding of why things get the scores they get.
Upvotes: 1