Reputation: 377
i have a kind of special problem (at least i think it is one ^^). I hope i can describe what i want to do:
I have a set of terms (Strings) each term also has a score (double). I now want to match these terms to the documents in my lucene index.
But i want to consider all possible combinations of these terms. At first my idea was to simply build a Giant
`BooleanQuery: field1:term1 OR field1:term2 .... OR field2:term1 OR field2:term2 ...`
But this query would of course not return the same results as generating a separate query for each term:
`Query1: field1:term1 OR field2:term1 ...`
`Query2: field1:term2 OR field2:term2 ...`
The problem is, my application is an ir-application, these terms are generated / extracted automatically and i don't know, which of the terms should be searched for together or better alone. So i want to have the "best of both worlds".
Is there a way to have a query that searches for all possible combinations of my List of terms?
Of course i could make some loops and generate a query for every possible combination but that will probably run forever...
Hope you understand what I want and can help me :) thanks!
Upvotes: 0
Views: 341
Reputation: 33341
Not quite sure what the final result set you want is, but here are a couple of possibilities:
If you simply want every match in any searched field against either term, then:
field1:term1 OR field1:term2 .... OR field2:term1 OR field2:term2 ...
Or
field1:term1 field1:term2 .... field2:term1 field2:term2 ...
is perfectly adequate.
If you want only want results that have at least one match on all available terms, but in any searched field, then you could structure the query like this:
+(field1:term1 field2:term1) +(field1:term2 field2:term2) ...
Alternatively, you could merge the fields you want to search here into a single searchable field, making them much easier to search together. Whether that is the better way to accomplish this depends on your application though.
As far as tuning your query to prevent one term dominate your search results:
I think the first step on tuning your query would be to find out why certain terms are dominating your results. Key to that would be learning to use: IndexSearcher.explain(query,doc). This will explain how the document was scored. Luke provides a nice interface to try queries against an index, and see why documents get the scores they do.
Also, TFIDFSimilarity documents the main pieces of the DefaultSimilarity class that calculates the scores by default. The documentation there will help to understand certain aspects of the scoring parameters displayed in Luke/explain(query,doc).
My best guess at the problem, is that you may be hitting with the same common term in multiple fields. This will compound the score for that term in each field it's found in, and can wipe out results for terms that appear in only one field (but may be equally relevant, in your case). In that case, you can fix it by wrapping multiple field searching the same term with a DisjunctionMaxQuery.
For example:
BooleanQuery root = new BooleanQuery()
DisjunctionMaxQuery dismax1 = new DisjunctionMaxQuery(1.1);
dismax.add(new TermQuery(new Term("field1", "term1")));
dismax.add(new TermQuery(new Term("field2", "term1")));
//etc
root.add(dismax1, BooleanClause.occur.SHOULD);
DisjunctionMaxQuery dismax2 = new DisjunctionMaxQuery(1.1);
dismax.add(new TermQuery(new Term("field1", "term2")));
dismax.add(new TermQuery(new Term("field2", "term2")));
//etc
root.add(dismax2, BooleanClause.occur.SHOULD);
Upvotes: 1
Reputation: 26703
Not sure if this will be helpful, but you could take the information from all the fields and duplicating them in another single field.
I know it's redundant but if the disk space is not a problem, it could be more convenient to run queries so your query becomes
aggr_field:(term1 OR term2 OR term3)
Upvotes: 0