Reputation: 4991
Is there any string distance algorithm that doesnt not take into account the order of the words?
The following algorithms do not give the desired results(in that example the desired result should be 1):
import jaro
jaro.jaro_winkler_metric(u'Michael Jordan',u'Jordan Michael')
>>>0.47
import Levenshtein
Levenshtein.ratio('Michael Jordan', 'Jordan Michael')
>>>0.5
from difflib import SequenceMatcher
SequenceMatcher(None, 'Michael Jordan', 'Jordan Michael').ratio()
>>>0.5
One way to making that is to have the string in alphabetical order and later use on of the above algorithms:
''.join(sorted('Michael Jordan'))
>>>' JMaacdehilnor'
''.join(sorted('Jordan Michael'))
>>>' JMaacdehilnor'
But here the information of the name and surname is lost and will not have 'stable' results.
I have created a function ,using permutations
from itertools
, that takes all the possible compilations of the words and compare the strings and output the max value. The results are satisfactory but the whole procedure is really slow when I have to compare millions of names.
Something else that can be done is to sort the words such as:
' '.join(sorted('Michael Jordan'.split()))
>>>'Jordan Michael'
' '.join(sorted('Jordan Michael'.split()))
>>>'Jordan Michael'
Seems quite nice way and easy way to decrease the computations but we loose some sensitive cases. example:
name1 = ' '.join(sorted('Bizen Dim'.split()))
>>>'Bizen Dim'
name2 = ' '.join(sorted('Dim Mpizen'.split()))
>>>'Dim Mpizen'
SequenceMatcher(None, name1, name2).ratio()
>>> 0.55
These two names are the same as there are cases where people 'translating' their names from 'b' to 'mp' (I am one of them). Using this way we are loosing this 'match'.
Is there any string distance algorithm that compares the words and do not take into consideration the order of the words? Or is there a recommendation how to implement efficiently the desired function?
Upvotes: 5
Views: 3941
Reputation: 9441
try fuzzywuzzy
install:
pip install fuzzywuzzy
pip install python-Levenshtein
use with order not mattering:
from fuzzywuzzy import fuzz
fuzz.token_sort_ratio(u'Michael Jordan',u'Jordan Michael')
>>100
Upvotes: 4
Reputation: 1012
You can tokenize the two strings (say, with the NLTK tokenizer), compute the distance between every word pair and return the sum of all distances.
Upvotes: 0
Reputation: 861
Try casting to lowercase, then sorting. Your problem with sorting with the original string is python sees capitals as higher in the order. (if you're going for levenshtein distance, the spaces shouldn't be an issue)
>>> ''.join(sorted('Michael Jordan'.lower()))
' aacdehijlmnor'
Then use the .index()
method to get substring positions. (you can also use this answer that uses the re
module and makes it much more varitable)
Upvotes: 0