sorting by exact match of multi-field

Question

I'm parsing JMdict (XML DTD), and I'm searching queries by keb, reb, and gloss fields. I'm adding fields to the Documents like so:

Document	`keb`	`reb`	`gloss`
A	明日	あした	tomorrow
		あす	near future
		みょうにち
--------
B	明日葉	あしたば	Angelica keiskei (species of angelica, a herb of the parsley family)
	鹹草	あしたぐさ
		アシタバ

I'm having the issue where a WildcardQuery on keb for 明日* will report document B over document A. However, document A has a keb that's an exact match. All I'm sorting by right now is the FIELD_SCORE: how do I prioritize documents by the length of the keb entry that matches the query? This is not the shortest keb entry: both document A and document B have entries of length 2, but document A is the only one where the entry containing 明日 has length 2.

// TODO sort KEB searches by
// 1. length of keb with exact match
// 2. commonality of entry
// 3. Lucene score
private static final Sort KEB_SORT =
        new Sort(SortField.FIELD_SCORE);

If it matters, in my application I fall back from gloss to keb in the case of entries that use Latin characters, for example Z検定 or rkgk. I do a small search to make sure there's a nonzero number of entries, and return the partial search if it wins.

// initially
TopDocs td = is.search(wildcardQuery, 1, sortForTheQuery);
if(td.totalHits.value() > 0L) return td;

// later
TopDocs rest = is.searchAfter(
    initialSearch.scoreDocs[initialSearch.scoreDocs.length - 1],
    chain.luceneQuery(),
    (int) initialSearch.totalHits.value()
);

sorting by exact match of multi-field

Answers (0)

Related Questions