Reputation: 1386
I am writing a search algorithm for text documents. The algorithm takes a set of keywords and returns those documents which closely match the provided set of keywords -- like a web search engine.
Right now I have implemented the TF-IDF metric for measuring document relevance, but that metric doesn't consider term concurrence, so it will rate a document that has the terms close to each other equal to one that has them at opposite sides of the document, even though the document with the terms being next to each other is probably the more relevant one.
So I want to find where the query terms occur the closest in a document. This is complicated, because there can be any number of terms involved, and the weighting factors that determine how good a particular placement is are somewhat arbitrary. As a first pass at the problem, here's a bruteforce solution for this:
Given a set of search terms {a, b, c}
and a document [a1, _, _, b1, _, c1, b2, _, _, a2, _, a3]
:
a
term at a1
, a2
and a3
; for each of those:b
term at b1
and b2
; for each of those:c
term at c1
(and nowhere else, because that's the only occurrence).{aN, bN, cN}
, measure sum([abs(first, second) for first, second in itertools.product(positions, positions)])
; lower score is better.This works, but requires O(n^t * t^2)
runtime for each document to be scored, where n
is the number of terms in the document, and t
is the number of search terms. This becomes infeasible rather quickly, and reminds me of NP-hard problems like the knapsack problem.
What options are there for getting the value of this metric without trying all the combinations? Approximate solutions are okay, as is storing some additional information per document (as long as it's not the precomputed bruteforce values).
Upvotes: 0
Views: 50
Reputation: 3750
There're standard feedback models that does this... the complexity is still linear in |V|, i.e., O(|VQ|) (V is the vocabulary of top-k documents, and |Q| is the number of query terms which is practically a small number). After you estimate the co-occurrence likelihoods, you simply rerank your initial retrieved list.
The standard feedback model to use is the relevance model.
Upvotes: 0