Reputation: 11
The texts I query for (and the queries itself) have on average 11 words (up to about 25). I want my query to return matches only if at least half of the words in query are matched in text.
For example, this is how my initial Lucene query looks like (for simplicity it has only 4 words):
jakarta~ apache~ lucene~ stackoverflow~
It will return a match if at least one of the words is fuzzy matched but I want it to return a match only if at least any two (half of 4) of the words are fuzzy matched.
Is it possible in Lucene?
I could split my query like this (OR
is default operator in Lucene):
(jakarta~ apache~) AND (lucene~ stackoverflow~)
But that wouldn’t return a match is both jakarta
and apache
are matched but none of lucene
and stackoverflow
is matched.
I could change my query to:
(jakarta~ AND apache~) (jakarta~ AND lucene~) (jakarta~ AND stackoverflow~)
(apache~ AND lucene~) (apache~ and stackoverflow~) (lucene~ AND stackoverflow~)
Would that be efficient? On average my expression would consist of 462 AND
clauses (binomial coefficient of 11 and 6), in the worst case of 5200300 AND
clauses (binomial coefficient of 25 and 13).
If it is not possible (or doesn’t make sense performance wise) to do in Lucene, is it possible in Elasticsearch or Solr?
It should work fast (<= 0.5 sec/search) for at least 10 000 texts in database.
It would be even better if I could easily later change the minimum matches percentage (e.g. 40% instead of 50%) but I may not need this.
Upvotes: 1
Views: 1937
Reputation: 33341
All three options support a minimum should match functionality among optional query clauses.
Lucene: Set in BooleanQueries via the BooleanQuery.Builder.setMinimumShouldMatch
method.
Solr: The DisMax mm
parameter.
Elasticsearch: The minimum_should_match
parameter, in Bool queries, Multi Match queries, etc.
Upvotes: 2
Reputation: 9789
In Solr, you can use minimum match (mm) parameter with DisMax and eDisMax and you can specify the percentage of the match expected.
Upvotes: 0