Reputation: 935
I'm parsing JMdict (XML DTD), and I'm searching queries by keb
, reb
, and gloss
fields. I'm adding fields to the Document
s like so:
Document | keb |
reb |
gloss |
---|---|---|---|
A | 明日 | あした | tomorrow |
あす | near future | ||
みょうにち | |||
-------- | |||
B | 明日葉 | あしたば | Angelica keiskei (species of angelica, a herb of the parsley family) |
鹹草 | あしたぐさ | ||
アシタバ |
I'm having the issue where a WildcardQuery
on keb
for 明日*
will report document B over document A. However, document A has a keb
that's an exact match. All I'm sorting by right now is the FIELD_SCORE
: how do I prioritize documents by the length of the keb
entry that matches the query? This is not the shortest keb
entry: both document A and document B have entries of length 2
, but document A is the only one where the entry containing 明日
has length 2
.
// TODO sort KEB searches by
// 1. length of keb with exact match
// 2. commonality of entry
// 3. Lucene score
private static final Sort KEB_SORT =
new Sort(SortField.FIELD_SCORE);
If it matters, in my application I fall back from gloss
to keb
in the case of entries that use Latin characters, for example Z検定
or rkgk
. I do a small search to make sure there's a nonzero number of entries, and return the partial search if it wins.
// initially
TopDocs td = is.search(wildcardQuery, 1, sortForTheQuery);
if(td.totalHits.value() > 0L) return td;
// later
TopDocs rest = is.searchAfter(
initialSearch.scoreDocs[initialSearch.scoreDocs.length - 1],
chain.luceneQuery(),
(int) initialSearch.totalHits.value()
);
Upvotes: 1
Views: 49