Gooner
Gooner

Reputation: 1398

Approximate searching against a list of strings

So yes, I read about how edit distance can be used between strings to decide how "close" are two string to each other. This algorithm, implemented as a Dynamic problem takes O(mn) time, where m and n are the lengths of the text and pattern respectively. So if I have to match a string against 5000 odd other strings, it's gonna take a LOT of time, which on my application is simply not acceptable. Is there are faster solution that can be implemented? I don't mind trading storage space for time.

I have seen an application called "Swype" on Android, which does something similar. It searches your query against it's own database and suggests results. How does that work so quickly?

Note: Please don't suggest frameworks like Lucene, because I cannot run then on J2ME.

Upvotes: 6

Views: 1456

Answers (4)

Howard
Howard

Reputation: 39177

It really depends on the texts you are comparing. In the following I present two speed-up approches within the original edit-distance framework.

We once had the same task where we combined a short word sequence (something like 10-30 chars) against a dictionary of >300k short sentences (also 10-30 chars each). In this case the following approach saved us a lot of time:

  • sort the dictionary of target strings (this has to be done only once)
  • when you build the n*m table of string i you can reuse the table from string i-1 since most of the lines are in common.

E.g. if you have the two strings "list of strings" and next "list of words" you can reuse the first 8 lines of your table and only have to recalculate 5 (both strings have 8 characters in common). This way we saved up to 70-80% of the runtime with only small changes to our code.

If instead you have few long texts the first approach won't save you much. But in this case you expect that only few entries have small edit distance while all others have a huge distance. Since the n*m table is somewhat monotonic in each direction (i.e. the minimum of each line is monotonic, as well as for each column) you can stop the calculation once you reach a pre-specified threshold. You can even save the intermediate results and "restart" the calculation (with a higher bound) if you do not find a solution within your initial threshold.

Upvotes: 1

Mikola
Mikola

Reputation: 9326

splix's answer is good. As another option (for very large string sets), you may want to consider using an n-gram representation:

http://en.wikipedia.org/wiki/N-gram

These are used for approximate pattern matching in a lot of database packages since they are fast and easy to implement using conventional indexing methodologies.

Upvotes: 1

Igor Artamonov
Igor Artamonov

Reputation: 35961

We had used http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm for nearly same thing, and it worked fine for us.

There is few Java implementations of it, you can find them on the web

PS you can also check other string-matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm

Upvotes: 1

flolo
flolo

Reputation: 15486

Its also a matter of how you define "close". If you are not insisting on written, but spoken would also work, I could suggest soundex. Its a very fast algorithm to see if 2 words a phonetic close.

Upvotes: 0

Related Questions