teddy teddy
teddy teddy

Reputation: 3075

how does Lucene/GoogleSearch merge postings lists?

this question always puzzled me. I know the basic starting concepts of search index: you have an inverted index, pointing from term to postings (doc) list; doc list is sorted by docID; in the case of multi-term query, the fetched postings lists from all terms are intersection'ed by jumping through skiplists pointers.

then it all stops there. for actual more details, I am completely lost. for example, if I search "American dog". then "dog" gives a long postings list (possibly billions, or even in Google's actual implementation, could be trillions of pages), and "America" gives a long postings list. then the questions:

  1. are both postings lists indeed sorted by docID ?
  2. ok, if sorted by docID, we can join them by jumping through the skipList. but what if the quality (PageRank score , in Google's case, or generally other quality scores native to the page) of the low-docID pages are low, and the actual very best pages are way back in the postings list? that way we'd have to really retrieve the ENTIRE interesection'ed list of the 2 postings lists, in order to return the top K best matching results. this would be still the case even if we don't consider PageRank or other page-native scores, and simply count TF/IDF, since the TF/IDF sorting order is different from docID sorting order.
  3. if the postings list is not sorted by docID , for example it is sorted by PageRank order, then how do we find the intersection quickly, without retrieving the entirety of each postings list?
  4. in more complicated cases in Google Search, user's implicit query terms (such user's history, profile (gender, age, country) ) should be used in the scoring model together with attributes of the candidate page, to compute the final score. I understand that this scoring model is likely applied to only a much shorter list of candidates that survive the first stage of search. but again coming back to the first 3 questions, how do we quickly return the top K first stage search results?

thanks

Upvotes: 1

Views: 170

Answers (0)

Related Questions