Sandy
Sandy

Reputation: 53

Lucene 6 - How to influence ranking with numeric value?

I am new to Lucene, so apologies for any unclear wording. I am working on an author search engine. The search query is the author name. The default search results are good - they return the names that match the most. However, we want to rank the results by author popularity as well, a blend of both the default similarity and a numeric value representing the circulations their titles have. The problem with the default results is it returns authors nobody is interested in, and while I can rank by circulation alone, the top result is generally not a great match in terms of name. I have been looking for days for a solution for this.

This is how I am building my index:

    IndexWriter writer = new IndexWriter(FSDirectory.open(Paths.get(INDEX_LOCATION)),
        new IndexWriterConfig(new StandardAnalyzer()));
    writer.deleteAll();
    for (Contributor contributor : contributors) {
        Document doc = new Document();
        doc.add(new TextField("name", contributor.getName(), Field.Store.YES));
        doc.add(new StoredField("contribId", contributor.getContribId()));
        doc.add(new NumericDocValuesField("sum", sum));
        writer.addDocument(doc);
    }
    writer.close();

The name is the field we want to search on, and the sum is the field we want to weight our search results with (but still taking into account the best match for the author name). I'm not sure if adding the sum to the document is the correct thing to do in this situation. I know that there will need to be some experimentation to figure out how to best blend the weighting of the two factors, but my problem is I don't know how to do it in the first place.

Any examples I've been able to find are either pre-Lucene 4 or don't seem to work. I thought this was what I was looking for, but it doesn't seem to work. Help appreciated!

Upvotes: 4

Views: 1349

Answers (1)

Philipp Ludwig
Philipp Ludwig

Reputation: 4185

As demonstrated in the blog post you linked, you could use a CustomScoreQuery; this would give you a lot of flexibility and influence over the scoring process, but it is also a bit overkill. Another possibility is to use a FunctionScoreQuery; since they behave differently, I will explain both.

Using a FunctionScoreQuery

A FunctionScoreQuery can modify a score based on a field.

Let's say you create you are usually performing a search like this:

Query q = .... // pass the user input to the QueryParser or similar
TopDocs hits = searcher.search(query, 10); // Get 10 results

Then you can modify the query in between like this:

Query q = .....

// Note that a Float field would work better.
DoubleValuesSource boostByField = DoubleValuesSource.fromLongField("sum");

// Create a query, based on the old query and the boost
FunctionScoreQuery modifiedQuery = new FunctionScoreQuery(q, boostByField);

// Search as usual
TopDocs hits = searcher.search(query, 10);

This will modify the query based on the value of field. Sadly, however, there isn't a possibility to control the influence of the DoubleValuesSource (besides by scaling the values during indexing) - at least none that I know of.

To have more control, consider using the CustomScoreQuery.

Using a CustomScoreQuery

Using this kind of query will allow you to modify a score of each result any way you like. In this context we will use it to alter the score based on a field in the index. First, you will have to store your value during indexing:

doc.add(new StoredField("sum", sum)); 

Then we will have to create our very own query class:

private static class MyScoreQuery extends CustomScoreQuery {
    public MyScoreQuery(Query subQuery) {
        super(subQuery);
    }

    // The CustomScoreProvider is what actually alters the score
    private class MyScoreProvider extends CustomScoreProvider {

        private LeafReader reader;
        private Set<String> fieldsToLoad;

        public MyScoreProvider(LeafReaderContext context) {
            super(context);
            reader = context.reader();

            // We create a HashSet which contains the name of the field
            // which we need. This allows us to retrieve the document 
            // with only this field loaded, which is a lot faster.
            fieldsToLoad = new HashSet<>();
            fieldsToLoad.add("sum");
        }

        @Override
        public float customScore(int doc_id, float currentScore, float valSrcScore) throws IOException {
            // Get the result document from the index
            Document doc = reader.document(doc_id, fieldsToLoad);

            // Get boost value from index               
            IndexableField field = doc.getField("sum");
            Number number = field.numericValue();

            // This is just an example on how to alter the current score
            // based on the value of "sum". You will have to experiment
            // here.
            float influence = 0.01f;
            float boost = number.floatValue() * influence;

            // Return the new score for this result, based on the 
            // original lucene score.
            return currentScore + boost;
        }           
    }

    // Make sure that our CustomScoreProvider is being used.
    @Override
    public CustomScoreProvider getCustomScoreProvider(LeafReaderContext context) {
        return new MyScoreProvider(context);
    }       
}

Now you can use your new Query class to modify an existing query, similar to the FunctionScoreQuery:

Query q = .....

// Create a query, based on the old query and the boost
MyScoreQuery modifiedQuery = new MyScoreQuery(q);

// Search as usual
TopDocs hits = searcher.search(query, 10);

Final remarks

Using a CustomScoreQuery, you can influence the scoring process in all kinds of ways. Remember however that the method customScore is called for each search result - so don't perform any expensive computations there, as this would severely slow down the search process.

I've creating a small gist of a full working example of the CustomScoreQuery here: https://gist.github.com/philippludwig/14e0d9b527a6522511ae79823adef73a

Upvotes: 4

Related Questions