Reputation: 15
This is a very fundamental and silly doubt. I have read that in order to prevent large relevance assessments in TREC competitions (reference), the top-ranked documents returned by participating systems are pooled to create the set of documents for relevance assessment. However, my doubt is this:
If majority of the systems use a common model or a similar model with somewhat same parameters. For example if several systems use LSA with rank reduced to 100,120,150,105, etc. Then there are two problems. One, merging such results might not really give the documents relevant to each query as the documents returned might severely overlap. Two, the documents which are to be assessed are actually biased as per the models used by the participating systems. So the relevance judgements will not really be method agnostic.
I know I am missing something here and if anyone could guide me in finding the missing link it would be really helpful!
Upvotes: 1
Views: 416
Reputation: 241
Yes, those problems are possible, but tend not to matter in practice if the set of runs is diverse enough and the pool depth is deep enough. Justin Zobel examined this problem way back in 1998: https://ir.webis.de/anthology/1998.sigirconf_conference-98.38/
The TREC overview papers from TREC-7 and TREC-8 also give lots of details about the pools created for the early TREC ad hoc collections (TREC proceedings papers are posted in the Publications section of the trec web site, trec.nist.gov). We have also documented cases where pooling was not successful. See "Bias and the Limits of Pooling" https://www.nist.gov/publications/bias-and-limits-pooling-large-collections
Building large test collections that are fair, general-purpose, and affordable is an on-going research problem.
Ellen Voorhees
NIST
Upvotes: 3
Reputation: 3740
You are correct. Pooling has got its own problems and we have to live with it.
There're, however, ways of making the pooling process less biased towards a set of specific retrieval models.
Using a set of diverse retrieval models and different retrieval settings (e.g. using the title or the title and description as queries) often helps in reducing the overlap in the retrieved set of documents. The overlap isn't always a bad thing either because ending up retrieving a document in multiple lists (corresponding to different settings or retrieval models) may actually reinforce the belief of including this document in the pool.
Another approach that was followed in TREC was to encourage participating systems to include manually post-processed runs, in order to ensure that the documents being shown to the assessors involve some kind of a manual filtering instead of them being outputs of purely automated algorithms.
While it is true that the top-retrieved set is a function of a specific retrieval model, the idea that pooling uses is that with sufficient depth (say depth-100), it is highly unlikely that a document that's truly relevant would not be retrieved within the top-100 of any retrieval model. So, the higher number of settings (models and query formulation strategies) one uses and the higher the depth is, the lower becomes this probability of missing a truly relevant document.
However, it's certainly possible to extend the assessment pool for a retrieval model with characteristics completely different from the existing ones using which the initial pool was constructed.
Upvotes: 1