Reputation: 10709
I have to arrays, S & T, containig words (lowercased, trimmed, without diacritics). Number of words can be different. (most of the data is a kind of proper names, rather short (<5))
I need to find a good metrics (and its implementation, or maybe even research paper) which allows to calculate level of the similiarity of those arrays
Some ideas I have so far:
any other ideas?
Upvotes: 1
Views: 229
Reputation: 1101
For me, it looks like modeling documents using bag-of-words models http://en.wikipedia.org/wiki/Bag-of-words_model
Depending on your application, you can use different criteria for comparing two bag-of-words feature vectors like what you said in your application. In addition, there are models based on learning statically relationship between different words/sentences, such as topic models http://en.wikipedia.org/wiki/Topic_model
Upvotes: 1
Reputation: 122
If the strings are Western names, Soundex might be a starting point.
Upvotes: 0
Reputation: 1669
If the arrays are rather short then you can find the optimal pairing of the words given some rubric of word similarity. Then have some scoring layered on top for how far the string has to be rotated/contorted for the optimal pairings to be paired. This could be some kind of multiplier or maybe some other system.
One metric of word similarity which we recently learning about in Natural Language Processing is Levenshtein Distance. There's other more complex variants such as the Smith-Waterman algorithm (its linked on the wiki page). These algorithms are meant to measure orthographic similarity, so they are used in morphological analysis to give an idea of how similar words are based on appearance. The Smith-Waterman algorithm says that if one word is contained within the other word then they're are extremely similar no matter how long the suffix/prefix is.
Upvotes: 0