how does Lucene/GoogleSearch merge postings lists?
this question always puzzled me. I know the basic starting concepts of search index: you have an inverted index, pointing from term to postings (doc) list; doc list is sorted by docID; in the case of multi-term query, the fetched postings lists from all terms are intersection'ed by jumping through skiplists pointers.
then it all stops there. for actual more details, I am completely lost. for example, if I search "American dog". then "dog" gives a long postings list (possibly billions, or even in Google's actual implementation, could be trillions of pages), and "America" gives a long postings list. then the questions:
- are both postings lists indeed sorted by docID ?
- ok, if sorted by docID, we can join them by jumping through the skipList. but what if the quality (PageRank score , in Google's case, or generally other quality scores native to the page) of the low-docID pages are low, and the actual very best pages are way back in the postings list? that way we'd have to really retrieve the ENTIRE interesection'ed list of the 2 postings lists, in order to return the top K best matching results.
this would be still the case even if we don't consider PageRank or other page-native scores, and simply count TF/IDF, since the TF/IDF sorting order is different from docID sorting order.
- if the postings list is not sorted by docID , for example it is sorted by PageRank order, then how do we find the intersection quickly, without retrieving the entirety of each postings list?
- in more complicated cases in Google Search, user's implicit query terms (such user's history, profile (gender, age, country) ) should be used in the scoring model together with attributes of the candidate page, to compute the final score. I understand that this scoring model is likely applied to only a much shorter list of candidates that survive the first stage of search. but again coming back to the first 3 questions, how do we quickly return the top K first stage search results?
thanks