Reputation: 8500
This is actually a real problem I'm working on, but for simplicity, let's pretend I'm Google.
Say the user searches for "nanoscale tupperware". There aren't very many pages with both words... only about 3k. But there are ~2 million pages with "nanoscale" and ~4 million with "tupperware". Still, Google finds the 3k for me in 0.3 seconds.
How does it do it?
The only algorithm I'm aware of is to get the documents for "nanoscale", get the documents for "tupperware", and then do a list merge. But that's O(N + M), or O(5,000,000) which seems a little slow. Particularly if I'm running it on a desktop instead of an uber-fast cluster.
So is that actually what Google is doing, and their speed is due mostly to the fact that they're running this expensive computation on their massive distributed cluster?
Or is there a better algorithm that I'm not aware of? Wikipedia and Google aren't turning up anything for me.
Edit:
Since people seem to be focusing on the Google aspect of my question, I guess I'll restate it in the actual terms.
I have several very large (millions of items) indexes implemented as key/value pairs. Keys are simple words, values are Sets of documents. A common use case is to get the intersection of results on several searches on different indexes: the pain point is getting the intersection of the document sets.
I can re-implement my indexes however I want - it's mostly an academic project at this point.
Upvotes: 6
Views: 1220
Reputation: 101149
The way you're describing it, you already have an inverted index, with a posting list for each term (the list of documents). I'm not aware of a better solution than merge joining the posting lists for each term, and to the best of my knowledge, that's what fulltext indexing solutions like Lucene do. There's a couple of obvious optimisations you can make here, though:
Upvotes: 4
Reputation: 69342
What you're describing are called n-grams.
Google uses an algorithm called PageRank to search and sort the results which is implemented using MapReduce.
All of these topics have been discussed at length on Stackoverflow in the past. It should be fairly easy to look them up.
This probably doesn't help you a whole bunch since you likely don't have an enormous distributed system to run MapReduce on, but since you didn't really give us any details on what you're trying to index, it's difficult to suggest something that suited to your problem.
Upvotes: -1