I have to arrays, S & T, containig words (lowercased, trimmed, without diacritics). Number of words can be different. (most of the data is a kind of proper names, rather short (<5)) I need to find a good metrics (and its implementation, or maybe even research paper) which allows to calculate level of the similiarity of those arrays Some ideas I have so far: scoring all words which are presents in both arrays scoring all words which are presents in the same place in both arrays scoring longest common sequences all above + taking into account relative position of index (more important at the beginning) some type of levensthein (insert / delete count) using words instead of characters any other ideas?

arraysalgorithmtextlanguage-agnosticlevenshtein-distance

Reputation: 10709

good metrics for array of strings distance

I have to arrays, S & T, containig words (lowercased, trimmed, without diacritics). Number of words can be different. (most of the data is a kind of proper names, rather short (<5))

I need to find a good metrics (and its implementation, or maybe even research paper) which allows to calculate level of the similiarity of those arrays

Some ideas I have so far:

scoring all words which are presents in both arrays
scoring all words which are presents in the same place in both arrays
scoring longest common sequences
all above + taking into account relative position of index (more important at the beginning)
some type of levensthein (insert / delete count) using words instead of characters

any other ideas?

Upvotes: 1

Answers (3)

iampat

Reputation: 1101

For me, it looks like modeling documents using bag-of-words models http://en.wikipedia.org/wiki/Bag-of-words_model

Depending on your application, you can use different criteria for comparing two bag-of-words feature vectors like what you said in your application. In addition, there are models based on learning statically relationship between different words/sentences, such as topic models http://en.wikipedia.org/wiki/Topic_model

Upvotes: 1

WaywiserTundish

Reputation: 122

If the strings are Western names, Soundex might be a starting point.

Upvotes: 0

emschorsch

Reputation: 1669

If the arrays are rather short then you can find the optimal pairing of the words given some rubric of word similarity. Then have some scoring layered on top for how far the string has to be rotated/contorted for the optimal pairings to be paired. This could be some kind of multiplier or maybe some other system.

One metric of word similarity which we recently learning about in Natural Language Processing is Levenshtein Distance. There's other more complex variants such as the Smith-Waterman algorithm (its linked on the wiki page). These algorithms are meant to measure orthographic similarity, so they are used in morphological analysis to give an idea of how similar words are based on appearance. The Smith-Waterman algorithm says that if one word is contained within the other word then they're are extremely similar no matter how long the suffix/prefix is.

Upvotes: 0

good metrics for array of strings distance

Answers (3)

Related Questions