data_the_goonie
data_the_goonie

Reputation: 65

Scoring word similarity between arbitrary text

I have a list of over 500 very important, but arbitrary strings. they look like:

list_important_codes = ['xido9','uaid3','frps09','ggix21']

What I know *Casing is not important, but all other characters must match exactly. *Every string starts with 4 alphabetical characters, and ends with either one or two numerical characters. *I have a list of about 100,000 strings,list_recorded_codes that were hand-typed and should match list_important_codes exactly, but about 10,000 of them dont. Because these strings were typed manually, the incorrect strings are usually only about 1 character off. (errors such as: *has an added space, *got two letters switched around, *has "01" instead of "1", etc)

What I need to do I need to iterate through list_recorded_codes and find all of their perfect matches within list_important_codes.

What I tried I spent about 10 hours trying to manually program a way to fix each word, but it seems to be impractical and incredibly tedious. not to mention, when my list doubles in size at a later date, i would have to manually go about that process again.

The solution I think I need, and the expected output

Im hoping that Python's NLTK can efficiently 'score' these arbitrary terms to find a 'best score'. For example, if the word in question is inputword = "gdix88", and that word gets compared to score(inputword,"gdox89")=.84 and score(inputword,"sudh88")=.21. with my expected output being highscore=.84, highscoreword='gdox89'

for manually_entered_text in ['xido9','uaid3','frp09','ggix21']:
--get_highest_score_from_important_words()  #returns word_with_highest_score
--manually_entered_text = word_with_highest_score

I am also willing to use a different set of tools to fix this issue if needed. but also, the simpler the better! Thank you!

Upvotes: 1

Views: 210

Answers (2)

Lydia van Dyke
Lydia van Dyke

Reputation: 2526

The 'score' you are looking for is called an edit distance. There is quite a lot of literature and algorithms available - easy to find, but only after you know the proper term :)

See the corresponding wikipedia article.

The nltk package provides an implementation of the so-called Levenshtein edit-distance:

from nltk.metrics.distance import edit_distance

if __name__ == '__main__':
    print(edit_distance("xido9", "xido9 "))
    print(edit_distance("xido9", "xido8"))
    print(edit_distance("xido9", "xido9xxx"))
    print(edit_distance("xido9", "xido9"))

The results are 1, 1, 3 and 0 in this case.

Here is the documentation of the corresponding nltk module

There are more specialized versions of this score that take into account how frequent various typing errors are (for example 'e' instead of 'r' might occur quite often because the keys are next to each other on a qwert keyboard).

But classic Levenshtein would were I would start.

Upvotes: 1

pakpe
pakpe

Reputation: 5489

You could apply a dynamic programming approach to this problem. Once you have your scoring matrix, you alignment_matrix and your local and global alignment functions set up, you could iterate through the list_important_codes and find the highest scoring alignment in the list_recorded_codes. Here is a project I did for DNA sequence alignment: DNA alignment. You can easily adapt it to your problem.

Upvotes: 0

Related Questions