Aly
Aly

Reputation: 935

sorting by exact match of multi-field

I'm parsing JMdict (XML DTD), and I'm searching queries by keb, reb, and gloss fields. I'm adding fields to the Documents like so:

Document keb reb gloss
A 明日 あした tomorrow
あす near future
みょうにち
--------
B 明日葉 あしたば Angelica keiskei (species of angelica, a herb of the parsley family)
鹹草 あしたぐさ
アシタバ

I'm having the issue where a WildcardQuery on keb for 明日* will report document B over document A. However, document A has a keb that's an exact match. All I'm sorting by right now is the FIELD_SCORE: how do I prioritize documents by the length of the keb entry that matches the query? This is not the shortest keb entry: both document A and document B have entries of length 2, but document A is the only one where the entry containing 明日 has length 2.

// TODO sort KEB searches by
// 1. length of keb with exact match
// 2. commonality of entry
// 3. Lucene score
private static final Sort KEB_SORT =
        new Sort(SortField.FIELD_SCORE);

If it matters, in my application I fall back from gloss to keb in the case of entries that use Latin characters, for example Z検定 or rkgk. I do a small search to make sure there's a nonzero number of entries, and return the partial search if it wins.

// initially
TopDocs td = is.search(wildcardQuery, 1, sortForTheQuery);
if(td.totalHits.value() > 0L) return td;

// later
TopDocs rest = is.searchAfter(
    initialSearch.scoreDocs[initialSearch.scoreDocs.length - 1],
    chain.luceneQuery(),
    (int) initialSearch.totalHits.value()
);

Upvotes: 1

Views: 49

Answers (0)

Related Questions