prestomanifesto
prestomanifesto

Reputation: 12796

lucene - give more weight the closer term is to beginning of title

I understand how to boost fields either at index time or query time. However, how could I increase the score of matching a term closer to the beginning of a title?

Example:

Query = "lucene"

Doc1 title = "Lucene: Homepage"
Doc2 title = "I have a question about lucene?"

I would like the first document to score higher since "lucene" is closer to the beginning (ignoring term freq for now).

I see how to use the SpanQuery for specifying the proximity between terms, but I'm not sure how to use the information about the position in the field.

I am using Lucene 4.1 in Java.

Upvotes: 11

Views: 4985

Answers (2)

phanin
phanin

Reputation: 5487

From the book "Lucene In Action 2"

" Lucene provides a built-in query PayloadTermQuery, in the package org.apache.lucene.search.payloads. This query is just like SpanTermQuery in that it matches all documents containing the specified term and keeps track of the actual occurrences (spans) of the matches.

But then it goes further by enabling you to contribute a scoring factor based on the payloads that appear at each term’s occurrence. To do this, you’ll have to create your own Similarity class that defines the scorePayload method, like this "

public class BoostingSimilarity extends DefaultSimilarity {
public float scorePayload(int docID, String fieldName,
int start, int end, byte[] payload,
int offset, int length) {
....
}

"start" in the above code is nothing but start position of the payload. Payload is associated with the term. So the start-position also applies to the term (at least that's what I believe..)

By using the above code, but disregarding the payload, you will have access to the "start" position at the place of scoring and then you may boost the score based on that start value.

For example : new score = original score * ( 1.0f / start-position )

I hope the above works, please post here if you find any other efficient solution..

Upvotes: 0

javanna
javanna

Reputation: 60205

I would make use of a SpanFirstQuery, which matches terms near the beginning of a field. As all span queries it relies on positions, enabled by default while indexing in lucene.

Let's test it independently: you just have to provide your SpanTermQuery and the maximum position where the term can be found (one in my example).

SpanTermQuery spanTermQuery = new SpanTermQuery(new Term("title", "lucene"));
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(spanTermQuery, 1);

Given your two documents this query will find only the first one with title "Lucene: Homepage", if you analyzed it with the StandardAnalyzer.

Now we can somehow combine the above SpanFirstQuery with a normal text query, and have the first one only influencing the score. You can easily do it using a BooleanQuery and putting the span query as a should clause like this:

Term term = new Term("title", "lucene");
TermQuery termQuery = new TermQuery(term);
SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
BooleanQuery booleanQuery = new BooleanQuery();
booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));

There are probably different ways to achieve the same, maybe using a CustomScoreQuery too, or custom code to implement the scoring, but this seems to me the easiest one.

The code I used to test it prints the following output (score included) executing the only TermQuery first, then the only SpanFirstQuery and finally the combined BooleanQuery:

------ TermQuery --------
Total hits: 2
title: I have a question about lucene - score: 0.26010898
title: Lucene: I have a really hard question about it - score: 0.22295055
------ SpanFirstQuery --------
Total hits: 1
title: Lucene: I have a really hard question about it - score: 0.15764984
------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------
Total hits: 2
title: Lucene: I have a really hard question about it - score: 0.26912516
title: I have a question about lucene - score: 0.09196242

Here is the complete code:

public static void main(String[] args) throws Exception {

        Directory directory = FSDirectory.open(new File("data"));

        index(directory);

        IndexReader indexReader = DirectoryReader.open(directory);
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);

        Term term = new Term("title", "lucene");

        System.out.println("------ TermQuery --------");
        TermQuery termQuery = new TermQuery(term);
        search(indexSearcher, termQuery);

        System.out.println("------ SpanFirstQuery --------");
        SpanFirstQuery spanFirstQuery = new SpanFirstQuery(new SpanTermQuery(term), 1);
        search(indexSearcher, spanFirstQuery);

        System.out.println("------ BooleanQuery: TermQuery (MUST) + SpanFirstQuery (SHOULD) --------");
        BooleanQuery booleanQuery = new BooleanQuery();
        booleanQuery.add(new BooleanClause(termQuery, BooleanClause.Occur.MUST));
        booleanQuery.add(new BooleanClause(spanFirstQuery, BooleanClause.Occur.SHOULD));
        search(indexSearcher, booleanQuery);
    }

    private static void index(Directory directory) throws Exception {
        IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_41, new StandardAnalyzer(Version.LUCENE_41));

        IndexWriter writer = new IndexWriter(directory, config);

        FieldType titleFieldType = new FieldType();
        titleFieldType.setIndexOptions(FieldInfo.IndexOptions.DOCS_AND_FREQS_AND_POSITIONS);
        titleFieldType.setIndexed(true);
        titleFieldType.setStored(true);

        Document document = new Document();
        document.add(new Field("title","I have a question about lucene", titleFieldType));
        writer.addDocument(document);

        document = new Document();
        document.add(new Field("title","Lucene: I have a really hard question about it", titleFieldType));
        writer.addDocument(document);

        writer.close();
    }

    private static void search(IndexSearcher indexSearcher, Query query) throws Exception {
        TopDocs topDocs = indexSearcher.search(query, 10);

        System.out.println("Total hits: " + topDocs.totalHits);

        for (ScoreDoc hit : topDocs.scoreDocs) {
            Document result = indexSearcher.doc(hit.doc);
            for (IndexableField field : result) {
                System.out.println(field.name() + ": " + field.stringValue() +  " - score: " + hit.score);
            }
        }
    }

Upvotes: 12

Related Questions