Lucene - Search by a field sorting by another, falling back to a secondary field

Question

I'm looking to develop a simple search with the following fields

title
summary
popularity

If someone searches by say "ga", I'll search the title for partial matches (e.g. "The Game"), and sort those results by popularity.

If there are < 10 results, I want to fall back to summary. However, I want the summary matches to be lower down than any title matches, again sorted by popularity

E.g. a search for "ga*"

"The Game"      | "About stuff"   | popularity = 3  | (title match)
"Gant charts"   | "great stats"   | popularity = 7  | (title match)
"Some Title"    | "mind the gap"  | popularity = 1  | (summary match)
"Another Title" | "blah games"    | popularity = 5  | (summary match)

I've written a simple implementation which executes 1 Lucene search, and if there are < 10 results, does a second search on synopsis - I then grammatically merge the results. However, this is not ideal, since there are duplicates I need to resolve and pagination will not work well - it would be better to do it all in 1 if possible.

Is this possible, and if yes, how?

(I'm currently developing this using the Java Lucene jar)

This is my current attempt (written in Scala)

// Creating the indexes
private def addDoc(w:IndexWriter , clientContent: ClientContent, contentType:String):Unit ={
  val doc:Document = new Document()
  doc.add(new TextField("title", clientContent.title, Field.Store.YES))
  doc.add(new TextField("synopsis", clientContent.synopsis, Field.Store.YES))
  doc.add(new StringField("id", clientContent.id, Field.Store.YES))
  doc.add(new IntField("popularity", 100000 - clientContent.popularity.day, Field.Store.YES))
  doc.add(new StringField("contentType", contentType, Field.Store.YES))
  w.addDocument(doc);
}

def createIndex: Unit = {
  index = new RAMDirectory()

  val analyzer = new StandardAnalyzer(Version.LUCENE_43)
  val config = new IndexWriterConfig(Version.LUCENE_43, analyzer)
  val w = new IndexWriter(index, config)

  clientApplication.shows.list.map {addDoc(w, _, "Show")}

  w.close()
  reader = IndexReader.open(index)
  searcher = new IndexSearcher(reader)
}


// Searching by one field 
def dataSearch(queryString: String, section:String, limit:Int):Array[ScoreDoc] = {

  val collector = TopFieldCollector.create(
  new Sort(new SortField("popularity", SortField.Type.INT, true)),
    limit,false,true, true, false);       

  val analyzer = new StandardAnalyzer(Version.LUCENE_43)
  val q = new QueryParser(Version.LUCENE_43, section, analyzer).parse(queryString+"*")
  searcher.search(q, collector)
  collector.topDocs().scoreDocs
}

// Searching for a query 
def search(queryString:String) = {
  println(s"search $queryString")

  val titleResults = dataSearch(queryString, "title", limit)

  if (titleResults.length < limit) {
    val synopsisResults = dataSearch(queryString, "synopsis", limit - titleResults.length)  
    createModel(titleResults  ++ synopsisResults)
  }
  else
    createModel(titleResults)
}

femtoRgon · Accepted Answer

You can sort by score first, and popularity second, and give a large boost to the query on title. Just doing that would work, as long as the score for all fields matching the title are equal, and the scores the docs matching only the summary are equal:

Sort mySort = new Sort(SortField.FIELD_SCORE, new SortField("popularity", SortField.Type.INT, true));

Of course, they probably won't be equal. idf shouldn't be an issue as long as the boost is large enough, but... If the fields of different documents are different lengths, the lengthNorm will make scores unequal, unless you have disabled norms. The coord factor will cause problems, since docs that match for both fields will then have even higher scores than those just matching title. And if a matching term appears more than once in a field, then tf will be markedly different.

So, you need a way to simplify the scoring, and prevent all the fancy lucene relevancy scoring logic from getting the way. You can get the scores to do what you want using ConstantScoreQuery and DisjunctionMaxQuery.

Query titleQuery = new ConstantScoreQuery(new PrefixQuery(new Term("title", queryString)));
titleQuery.setBoost(2);
Query summaryQuery = new ConstantScoreQuery(new PrefixQuery(new Term("title", queryString)));
//Combine with a dismax, so matching both fields won't get a higher score than just the title
Query finalQuery = new DisjnctionMaxQuery(0);
finalQuery.add(titleQuery);
finalQuery.add(summaryQuery);

Sort mySort = new Sort(
    SortField.FIELD_SCORE, 
    new SortField("popularity", SortField.Type.INT, true)
);

val collector = TopFieldCollector.create(mySort,limit,false,true,true,false);

searcher.search(finalQuery, collector);

For the code you've provided, this will work, since you don't really need the query parser beyond constructing the prefixquery. You could just as well keep the parser though. ConstantScoreQuery is a wrapper query. You could wrap the query returned from QueryParser.parse just as easily:

QueryParser parser = new QueryParser(Version.LUCENE_43, "title", analyzer);

Query titleQuery = new ConstantScoreQuery(parser.parse(queryString + "*"));
titleQuery.SetBoost(2);
Query summaryQuery = new ConstantScoreQuery(parser.parse("summary:" + queryString + "*"));

Lucene - Search by a field sorting by another, falling back to a secondary field

Answers (1)

Related Questions