Reputation: 357
The relevance model just estimates the relevance feedback based on feedback documents. In this case, the relevance model would have a higher probability of getting common words as its feedbacks. Thus I assumed the performance of the relevance model won't be so good comparing to the other two models. However, I learned that all those models perform pretty well. What would be the reason for that?
Upvotes: 0
Views: 139
Reputation: 3750
"In contrast, the relevance model just estimates the relevance feedback based on feedback documents. In this case, the relevance model would have a higher probability of getting common words as its feedbacks"
That's a common perception which isn't necessarily true. To be more specific, recall that the estimation equation of relevance model looks like:
P(w|R) = \sum_{D \in Top-K} P(w|D) \prod_{t \in Q} P(q|D)
which in simple English means that --
To compute the weight of a term w
in the set of top-K docs - you iterate over each document in top-K and multiply P(w|D)
with the similarity score of Q with D (this is the value \prod_{t \in Q} P(q|D)
). Now, the idf
factor is hidden inside the expression P(w|D)
.
Following the standard language model paradigm (Jelinek-Mercer or Dirichlet), this isn't just a simple max-likelihood estimate but is rather a collection smoothed version, e.g., for Jelinek-Mercer, this is:
P(w|D) = log(1+ lambda/(1-lambda) * count(w,D)/length(D) * collection_size/cf(t))
which is nothing but a linear combination based generalization of tf*idf - the second component collection_size/cf(t)
specifically denoting inverse collection frequency.
So, this expression of P(w|D)
ensures that terms with higher idf values tend to get higher weights in the relevance model estimation. In addition to the high idf weights, they should also have a high level of co-occurrence with the query terms due to the product of P(w|D) with P(q|D).
Upvotes: 0