Reputation: 31
I am using Levenshtein distance which is a string metric for measuring the amount of difference between two sequences to find the percent of difference between two strings. I want to use a better method to declare the strings are similar using words in the strings.
For example: Lets say I have a string with 2 paragraphs and the second string only contains the second paragraph of the first string.
I know I could compare the first word of each strings and then the second etc but that wouldn't be effective if a case like the last example I presented happens.
I was thinking maybe comparing the first word in the first string with all of the words of the second string but I am afraid this is would make the process very slow.
Upvotes: 0
Views: 805
Reputation: 1599
Comparing each word in the first string with all of the words in the second string could yield slightly better performance than Levenshtein distance, but would be on the same order of magnitude. Levenstein distance is O(m*n) and your algorithm would be O(m^2) (where m and n are lengths of strings).
If you only care about matching words (e.g. "color" and "colour" would be treated as two completely different strings) and disregard the word order (e.g. "red color" and "color red" would be treated as two same strings) and you don't care about the space complexity of your algorithm, you could create an index of words of your first string (e.g. a hashtable) and then compare each word in the second string against this index. This yields an algorithm of complexity O(m+n) if for your index you use a data structure with constant-time insertions and deletions.
Upvotes: 1