Rusty
Rusty

Reputation: 11

Minimum number of word matches in Lucene / Elasticsearch / Solr

The texts I query for (and the queries itself) have on average 11 words (up to about 25). I want my query to return matches only if at least half of the words in query are matched in text.

For example, this is how my initial Lucene query looks like (for simplicity it has only 4 words):

jakarta~ apache~ lucene~ stackoverflow~

It will return a match if at least one of the words is fuzzy matched but I want it to return a match only if at least any two (half of 4) of the words are fuzzy matched.

Is it possible in Lucene?

I could split my query like this (OR is default operator in Lucene):

(jakarta~ apache~) AND (lucene~ stackoverflow~)

But that wouldn’t return a match is both jakarta and apache are matched but none of lucene and stackoverflow is matched.

I could change my query to:

(jakarta~ AND apache~) (jakarta~ AND lucene~) (jakarta~ AND stackoverflow~)
(apache~ AND lucene~) (apache~ and stackoverflow~) (lucene~ AND stackoverflow~)

Would that be efficient? On average my expression would consist of 462 AND clauses (binomial coefficient of 11 and 6), in the worst case of 5200300 AND clauses (binomial coefficient of 25 and 13).

If it is not possible (or doesn’t make sense performance wise) to do in Lucene, is it possible in Elasticsearch or Solr?

It should work fast (<= 0.5 sec/search) for at least 10 000 texts in database.

It would be even better if I could easily later change the minimum matches percentage (e.g. 40% instead of 50%) but I may not need this.

Upvotes: 1

Views: 1937

Answers (2)

femtoRgon
femtoRgon

Reputation: 33341

All three options support a minimum should match functionality among optional query clauses.

Upvotes: 2

Alexandre Rafalovitch
Alexandre Rafalovitch

Reputation: 9789

In Solr, you can use minimum match (mm) parameter with DisMax and eDisMax and you can specify the percentage of the match expected.

Upvotes: 0

Related Questions