Ivan
Ivan

Reputation: 39

Comparing two strings using known algorithms

I'm trying to compare two strings (product names) using some of well known algorithms like Levenstein distance and library of different solutions for string simmetrics (got best results with SmithWatermanGotoh alg).

Two strings are:

iPhone 3gs 32 GB black

Apple iPhone 3 gs 16GB black

Levenstein is working pretty bad on whole string if some words are in different order (which is expected from how algorithm works) so I tried to implement word by word comparison.

The problem I'm facing with is the way to detect similar 'words' that are divided with space char ('3gs'->'3 gs' ; '32 GB'->'16GB').

My code compares shorter (word count, if == then str.length) string with longer one. Words are split into ArrayList<String>. I'm combining each word from str1 with others in the same string creating new arraylist.

Here is a rough code:

foreach(str1)

    foreach(str2)
        res1 = getLevensteinDist
    endforeach

    foreach(combinedstr2)
        res1 = getLevensteinDist
    endforeach      

    return getHigherPercent(res1, res2)

 endforeach

This works if the words in str2 are split, but I can't figure out how to do a reverse, detect words in str2 that are split in str1.

I hope I'm at least a bit clear what I'm trying to do. Every help is appreciated.

Upvotes: 0

Views: 297

Answers (4)

olegarch
olegarch

Reputation: 3891

13 years ago I wrote my own implementation of trigram fuzzy search algorithm, named "Wilbur-Khovayko algorithm".

You can download here: http://olegh.cc.st/wilbur-khovayko.tar.gz

It search "N closest terms" for entered search term.

List of terms - in the file termlist.txt N - in the variable lim, file findtest.c

Alrorithm very quick: on the old Sun 200mHz, it search 100 closest term among 100,000 entries for ~0.3 secs.

Upvotes: 0

usamec
usamec

Reputation: 2394

Try to split one of the string into words and then for eash word run SmithWaterman and use scores from SmithWaterman as similarity measure.

Upvotes: 0

dopplesoldner
dopplesoldner

Reputation: 9479

Have a look at TF-IDF. It is specifically designed to compute similarities between textual features.

http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html

Upvotes: 0

Saeed Amiri
Saeed Amiri

Reputation: 22565

First of all you should preprocess your strings, I mean you should remove "a, the, as, an" and all common verbs, numnbers,... from input strings, also you should convert every plural form to the singular form, .... to unify all words. Then you can apply some string matching algorithms, or just put the words into the hashmap, or if they are a lot, put them into the trie, and run your similarity algorithm.

Upvotes: 1

Related Questions