Reputation: 18552
I have a database with approximately 1 million names and addresses. This database should be exposed for a "Google Suggest"-like instant search on a web-page. I'm looking for an efficient algorithm/data structure which can help me achieve this.
What makes this more difficult than just using a Trie or a Generalized Suffix Tree, is that it must support queries with some of the names left out. For instance when a user types "Elvis Pr", "Elvis Aaron Presley" should be suggested.
I'm hoping to get the whole index in memory (I have about 4GB of RAM to be used for this).
The application is written in Java, so links to Java-based libraries are considered extra helpful. I've been looking a bit at Lucene and MG4J, but I haven't figured out which type of indexing I could use for my problem.
Upvotes: 3
Views: 2121
Reputation: 452
Have a look at Cleo, it looks you have a pretty similar use case. They use a Bloom filter to search prefixes and a Forward index to remove false positives.
Upvotes: 1
Reputation: 8553
Solr provides a full feature autosuggest feature OOTB. This link provides some background of generating from popular queries. But you could easily adapt it to build some a priori knowledge such as the scenario you describe.
Upvotes: 1
Reputation: 51226
Perhaps what you actually want is a search in which each word typed by the user must appear as a prefix of some word in a contact. This is a bit easier and faster than general substring search.
Upvotes: 2
Reputation: 8609
You could try using a bk-tree with a string distance metric such as levenshtein see also http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.
edit: Inventing a distance metric is hard, but I just happen to know one that you could use called structural entropic distance which is based on the information difference. It works as follows:
take two strings x = "Elvis Pr" and y = "Elvis Aaron Presley"
for each work out the multiset of uni- and bi-grams:
x = {e, l, v, i, s, _, p, r, el, lv, vi, is, s_, _p, pr}
y = {ex3, lx2, v, i, sx2, _x2, ax2, rx2, o, n, p, y, el, lv, vi, is, s_, _a, aa, ar, ro, on, n_, _p, pr, re, es, sl, le, ey}
now for those terms that are in both
{e, l, v, i, s, _, p, r, el, lv, vi, is, s_, _p, pr}
calculate the product (f_x(t) / (f_x(t) + f_y(t)))^{f_x(t)/2} * (f_y(t) / (f_x(t) + f_y(t)))^{f_y(t)/2}
so
e = ((1/15) / (1/15 + 3/37))^(1/30) * ((3/37) / (1/15 + 3/37))^(3/74)
l = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
v = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
i = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
s = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
_ = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
p = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
r = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
el = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
lv = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
vi = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
is = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
s_ = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
_p = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
pr = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
Multiply all these together and you should get a number in the range [0.5, 1] so you can scale this more usefully to the range [0,1] by multiplying by 2 and subtracting 1.
However this is not a discrete distance metric so you will have to use another metric index such as the vp-tree
Upvotes: 1