Reputation: 1151
This is an interview question and not a homework.
"You have N documents, where N is very large. Each document has a set of words lets say w1,w2..wm where m might differ for each document. Now you are given a list to K words lets say q1,q2…qk. Write an algorithm to print the list of document which have the K words in them."
Now, I could figure out solutions using Hashing and trie. But the guy who posted the question had also written that the interviewer wanted a solution using B-tree.
I am not really able to figure out how to use a B-Tree for this and how efficient would that be. Can somebody please help?
Upvotes: 3
Views: 448
Reputation: 56
B-Tree is preferred over Trie if our dataset is stored on media with slow random access, for example on conventional hard drives. The interviewer's note that N is very large might imply that it's simply large enough to not fit in memory and should be placed on disk.
As noted in comments: when the data is really huge and it is stored on a disk, the efficiency of a data structure depends more on the number of disk block accesses, not the total amount of all operations. B-Tree contains many records in one node (which can be considered a "data block"), thus requires significantly fewer block accesses than Trie does.
That is exactly the same reason why most DBs store their indexes in a B-Tree. They need fast search through index located on conventional hard drive. Actually, your problem can be solved by putting your (word-documentId) pairs in DB table and creating an index on word column or the entire pair.
Upvotes: 1
Reputation: 12592
You can try a ternary trie. It doesn't take so much space. You can also look for a Kart-trie. It uses a key and 2 leafs:http://code.dogmap.org/kart/.
Upvotes: 0