Reputation: 35
I have a schema that has a field of type array<string>
:
field titles type array<string> {
indexing: index | summary | attribute
index: enable-bm25
attribute: fast-search
}
Say titles
contains N
titles - Title 1
, Title 2
, ..., Title N
. I would like to rank documents based on the max bm25 between one of the titles in titles
and the query. In other words I would like the rank of the document to be equal to max(bm25('Title 1'),bm25('Title 2'),...,bm25('Title N'))
Just setting the ranking expression to bm25(titles)
does not achieve what I want. For e.g. given a query Q
with terms: term 1, term 2, term 3
and two documents:
{"titles": [".\*term 1.\*", ".\*term 1.\*", ".\*term 1.\*", ".\*term 1.\*"]}
{"titles": [".\*term 1 term 2 term 3.\*", STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS, STRING_WITH_NONE_OF_THE_TERMS]
Having the bm25(titles)
ranking expression ranks doc 1
higher than doc 2
. I assume it's because a term from the query is in all titles, while in the second doc a term from the query is only in one title. I want doc 2
to be ranked higher as it contains a title that is an almost complete match for the query, so max(bm25) should be higher for doc 2
but average/sum over all docs might be higher for doc 1
Is there a way I can achieve that in Vespa?
Upvotes: 2
Views: 440
Reputation: 3184
Thanks for the detailed question. Vespa does not support this for the bm25
rank feature. It is computed over all elements.
You can achieve similar functionality using rank-features designed for multi-valued fields. See https://docs.vespa.ai/en/searching-multi-valued-fields.html, https://docs.vespa.ai/en/reference/rank-features.html#features-for-indexed-multivalue-string-fields.
Unrelated: Note that unless you want to group on this field, you don't want to use attribute
as it puts everything in memory.
field titles type array<string> {
indexing: index | summary
index: enable-bm25
}
Upvotes: 2