Reputation: 39
I'm trying to compare two strings (product names) using some of well known algorithms like Levenstein distance and library of different solutions for string simmetrics (got best results with SmithWatermanGotoh alg).
Two strings are:
iPhone 3gs 32 GB black
Apple iPhone 3 gs 16GB black
Levenstein is working pretty bad on whole string if some words are in different order (which is expected from how algorithm works) so I tried to implement word by word comparison.
The problem I'm facing with is the way to detect similar 'words' that are divided with space char ('3gs'->'3 gs' ; '32 GB'->'16GB').
My code compares shorter (word count, if == then str.length) string with longer one. Words are split into ArrayList<String>
. I'm combining each word from str1 with others in the same string creating new arraylist.
Here is a rough code:
foreach(str1)
foreach(str2)
res1 = getLevensteinDist
endforeach
foreach(combinedstr2)
res1 = getLevensteinDist
endforeach
return getHigherPercent(res1, res2)
endforeach
This works if the words in str2 are split, but I can't figure out how to do a reverse, detect words in str2 that are split in str1.
I hope I'm at least a bit clear what I'm trying to do. Every help is appreciated.
Upvotes: 0
Views: 297
Reputation: 3891
13 years ago I wrote my own implementation of trigram fuzzy search algorithm, named "Wilbur-Khovayko algorithm".
You can download here: http://olegh.cc.st/wilbur-khovayko.tar.gz
It search "N closest terms" for entered search term.
List of terms - in the file termlist.txt N - in the variable lim, file findtest.c
Alrorithm very quick: on the old Sun 200mHz, it search 100 closest term among 100,000 entries for ~0.3 secs.
Upvotes: 0
Reputation: 2394
Try to split one of the string into words and then for eash word run SmithWaterman and use scores from SmithWaterman as similarity measure.
Upvotes: 0
Reputation: 9479
Have a look at TF-IDF. It is specifically designed to compute similarities between textual features.
http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html
Upvotes: 0
Reputation: 22565
First of all you should preprocess your strings, I mean you should remove "a, the, as, an" and all common verbs, numnbers,... from input strings, also you should convert every plural form to the singular form, .... to unify all words. Then you can apply some string matching algorithms, or just put the words into the hashmap, or if they are a lot, put them into the trie, and run your similarity algorithm.
Upvotes: 1