user1189332
user1189332

Reputation: 1941

Lucene custom similarity/scoring

I'm looking out for a similarity module in Lucene (Java) that gives a weightage based score. I know this is vague, better to explain with an example.

Document 1
-----------
Firstname: Francesca

Document 2
-----------
Firstname: Francisco

The Firstname field is analysed using Doublemetaphone & Refined Soundex phonetic algorithms. during indexing stage.

Therefore, the inverted index looks like this (The last two terms are given by Doublemetaphone and REfined Soundex respectively):

francesca ===> Doc1
francisco ===> Doc2
FRNS   ===> Doc1, Doc2
F29083030 ===> Doc1
F2908306 ===> Doc2

Now my search query looks like this: Firstname: "francesca"

Obviously, For Doc1, all the 4 terms match. For each match, I want to give a percentage of 25% (I know in advance that there can only be a max of 4 expanded terms for a given term.

Going by this principle, I want to give the following score:

Doc1 (100)  [Reason: All 4 terms match]
Doc2 (25)  [Reason: Only FRNS term matches, rest don't match]

Now my question here is, to achieve this, is there any similarity module available out of the shelf? If not, I believe I should extend the DefaultSimilarity and override the necessary methods. But where is the module that calls the similarity module and sums up all the scores per document? The reason I ask is I will extend this weightage based scoring for other fields too in which case, the total score per document will be the sum of weighted average of individual fields. Therefore, I should also customise the code that sums up the scores of individual fields and override it to find the average. Can someone show some pointers please? Thanks.

Upvotes: 1

Views: 1200

Answers (1)

Vineeth Mohan
Vineeth Mohan

Reputation: 19283

A good place to start this would be Jörg Prante project - https://github.com/jprante/elasticsearch-payload

Along with other projects , he have also extended similarity module.

Further on the implementation , I would advice you to look into the type field or payload field of the token to deduce the score.

In the following file - https://github.com/jprante/elasticsearch-payload/blob/master/src/main/java/org/xbib/elasticsearch/plugin/payload/PayloadPlugin.java

You can see following code sample on how to add similarity module.

public void onModule(SimilarityModule module) {
    module.addSimilarity("payload_similarity", PayloadSimilarityProvider.class);
}

Upvotes: 1

Related Questions