sriram
sriram

Reputation: 732

Solr Minimum match customization

I have a case wherein I would like to match like this:

Query: abcd efgh ijkl mnop

After this the Query is subjected to NGram tokenizer and each word is split up into 2 gram tokens.

eg) The query is split up into,

ab,bc,cd,ef,fg,gh,ij,jk,kl,mn,no,op

Now while matching I want the minimum match to be customized for tokens in words.

I mean, By default when any one token corresponding to a word matches with the indexed document, with mm=1, that indexed document is returned. And if I give mm=2, then any one token from any 2 words need to match the indexed document to be returned.

But what I want is: Return a document only when any 'm' tokens each match for mm=num of words.

For example) I would want atleast 2 tokens each from atleast 3 words for the indexed document to be selected.

Seems IndexSearcher of Lucene does this core part. Do I need to change the code or any other config which would do the above stuff?

Thanks in advance...

Upvotes: 0

Views: 866

Answers (1)

Xodarap
Xodarap

Reputation: 11849

This isn't exactly what you're asking for, but I'm guessing your underlying question is "how can I ensure that fuzzy searches only return things which are 'close' to the original query?"

The syntax foo~.8 does this - see the docs. Basically, .8 is a measure of the edit (Levenstein) distance divided by the length of the word.

If you want to stick to your idea of counting pairs which must match, you can do some math to figure out what the minimum levenstein distance needs to be.

Upvotes: 1

Related Questions