Algorithm for fast substring search in a contact database

I have a database with approximately 1 million names and addresses. This database should be exposed for a "Google Suggest"-like instant search on a web-page. I'm looking for an efficient algorithm/data structure which can help me achieve this.

What makes this more difficult than just using a Trie or a Generalized Suffix Tree, is that it must support queries with some of the names left out. For instance when a user types "Elvis Pr", "Elvis Aaron Presley" should be suggested.

I'm hoping to get the whole index in memory (I have about 4GB of RAM to be used for this).

The application is written in Java, so links to Java-based libraries are considered extra helpful. I've been looking a bit at Lucene and MG4J, but I haven't figured out which type of indexing I could use for my problem.

Upvotes: 3

Answers (4)

zdepablo

Reputation: 452

Have a look at Cleo, it looks you have a pretty similar use case. They use a Bloom filter to search prefixes and a Forward index to remove false positives.

Upvotes: 1

Mikos

Reputation: 8553

Solr provides a full feature autosuggest feature OOTB. This link provides some background of generating from popular queries. But you could easily adapt it to build some a priori knowledge such as the scenario you describe.

Upvotes: 1

j_random_hacker

Reputation: 51226

Perhaps what you actually want is a search in which each word typed by the user must appear as a prefix of some word in a contact. This is a bit easier and faster than general substring search.

Build a single sorted array of all words belonging to any contact, and store a "contact ID" field alongside each word (e.g. [Aaron/1, Aleksander/2, Blomskøld/2, Elvis/1, Presley/1]).
Separately for each word typed by the user, use binary search to find the range of names starting with that word (this will necessarily be a contiguous range of indices within the array). Since the user is usually adjusting only one word with each keystroke, you will only need to recompute one of these ranges with each keystroke -- and in fact even this recomputation step can be done more efficiently in the common case of typing an additional letter, since this can only narrow the range of matching words.
Finally, intersect the sets of contact IDs to produce the list of possibilities. To show the possibilities, you will need a second array, indexed by contact ID and containing the full names.

Upvotes: 2

Robert

Reputation: 8609

You could try using a bk-tree with a string distance metric such as levenshtein see also http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees.

edit: Inventing a distance metric is hard, but I just happen to know one that you could use called structural entropic distance which is based on the information difference. It works as follows:

take two strings x = "Elvis Pr" and y = "Elvis Aaron Presley"

for each work out the multiset of uni- and bi-grams:

x = {e, l, v, i, s, _, p, r, el, lv, vi, is, s_,  _p, pr}
y = {ex3, lx2, v, i, sx2, _x2, ax2, rx2, o, n, p, y, el, lv, vi, is, s_, _a, aa, ar, ro, on, n_, _p, pr, re, es, sl, le, ey}

now for those terms that are in both

{e, l, v, i, s, _, p, r, el, lv, vi, is, s_, _p, pr}

calculate the product (f_x(t) / (f_x(t) + f_y(t)))^{f_x(t)/2} * (f_y(t) / (f_x(t) + f_y(t)))^{f_y(t)/2} so

e  = ((1/15) / (1/15 + 3/37))^(1/30) * ((3/37) / (1/15 + 3/37))^(3/74)
l  = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
v  = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
i  = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
s  = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
_  = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
p  = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
r  = ((1/15) / (1/15 + 2/37))^(1/30) * ((2/37) / (1/15 + 2/37))^(2/74)
el = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
lv = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
vi = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
is = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
s_ = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
_p = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)
pr = ((1/15) / (1/15 + 1/37))^(1/30) * ((1/37) / (1/15 + 1/37))^(1/74)

Multiply all these together and you should get a number in the range [0.5, 1] so you can scale this more usefully to the range [0,1] by multiplying by 2 and subtracting 1.

However this is not a discrete distance metric so you will have to use another metric index such as the vp-tree

Upvotes: 1

Algorithm for fast substring search in a contact database

Answers (4)

Related Questions