ts.
ts.

Reputation: 10709

good metrics for array of strings distance

I have to arrays, S & T, containig words (lowercased, trimmed, without diacritics). Number of words can be different. (most of the data is a kind of proper names, rather short (<5))

I need to find a good metrics (and its implementation, or maybe even research paper) which allows to calculate level of the similiarity of those arrays

Some ideas I have so far:

any other ideas?

Upvotes: 1

Views: 229

Answers (3)

iampat
iampat

Reputation: 1101

For me, it looks like modeling documents using bag-of-words models http://en.wikipedia.org/wiki/Bag-of-words_model

Depending on your application, you can use different criteria for comparing two bag-of-words feature vectors like what you said in your application. In addition, there are models based on learning statically relationship between different words/sentences, such as topic models http://en.wikipedia.org/wiki/Topic_model

Upvotes: 1

WaywiserTundish
WaywiserTundish

Reputation: 122

If the strings are Western names, Soundex might be a starting point.

Upvotes: 0

emschorsch
emschorsch

Reputation: 1669

If the arrays are rather short then you can find the optimal pairing of the words given some rubric of word similarity. Then have some scoring layered on top for how far the string has to be rotated/contorted for the optimal pairings to be paired. This could be some kind of multiplier or maybe some other system.

One metric of word similarity which we recently learning about in Natural Language Processing is Levenshtein Distance. There's other more complex variants such as the Smith-Waterman algorithm (its linked on the wiki page). These algorithms are meant to measure orthographic similarity, so they are used in morphological analysis to give an idea of how similar words are based on appearance. The Smith-Waterman algorithm says that if one word is contained within the other word then they're are extremely similar no matter how long the suffix/prefix is.

Upvotes: 0

Related Questions