Reputation: 1398
So yes, I read about how edit distance can be used between strings to decide how "close" are two string to each other. This algorithm, implemented as a Dynamic problem takes O(mn) time, where m and n are the lengths of the text and pattern respectively. So if I have to match a string against 5000 odd other strings, it's gonna take a LOT of time, which on my application is simply not acceptable. Is there are faster solution that can be implemented? I don't mind trading storage space for time.
I have seen an application called "Swype" on Android, which does something similar. It searches your query against it's own database and suggests results. How does that work so quickly?
Note: Please don't suggest frameworks like Lucene, because I cannot run then on J2ME.
Upvotes: 6
Views: 1456
Reputation: 39177
It really depends on the texts you are comparing. In the following I present two speed-up approches within the original edit-distance framework.
We once had the same task where we combined a short word sequence (something like 10-30 chars) against a dictionary of >300k short sentences (also 10-30 chars each). In this case the following approach saved us a lot of time:
i
you can reuse the table from string i-1
since most of the lines are in common.E.g. if you have the two strings "list of strings"
and next "list of words"
you can reuse the first 8 lines of your table and only have to recalculate 5 (both strings have 8 characters in common). This way we saved up to 70-80% of the runtime with only small changes to our code.
If instead you have few long texts the first approach won't save you much. But in this case you expect that only few entries have small edit distance while all others have a huge distance. Since the n*m table is somewhat monotonic in each direction (i.e. the minimum of each line is monotonic, as well as for each column) you can stop the calculation once you reach a pre-specified threshold. You can even save the intermediate results and "restart" the calculation (with a higher bound) if you do not find a solution within your initial threshold.
Upvotes: 1
Reputation: 9326
splix's answer is good. As another option (for very large string sets), you may want to consider using an n-gram representation:
http://en.wikipedia.org/wiki/N-gram
These are used for approximate pattern matching in a lot of database packages since they are fast and easy to implement using conventional indexing methodologies.
Upvotes: 1
Reputation: 35961
We had used http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm for nearly same thing, and it worked fine for us.
There is few Java implementations of it, you can find them on the web
PS you can also check other string-matching algorithms: http://en.wikipedia.org/wiki/String_searching_algorithm
Upvotes: 1
Reputation: 15486
Its also a matter of how you define "close". If you are not insisting on written, but spoken would also work, I could suggest soundex
. Its a very fast algorithm to see if 2 words a phonetic close.
Upvotes: 0