abdullah temel
abdullah temel

Reputation: 23

word similarty between mail adresses and names

My problem is little bit different from simple word similarty.The question is that is there any algorithm to use for calculating similarty between mail adress and name.

    for example:
    mail [email protected]
    Name Abdullah temel
    levenstein,hamming distance  11
    jaro distance  0.52

but most likely, this mail address belongs to this name.

Upvotes: 2

Views: 70

Answers (2)

Naitik Chandak
Naitik Chandak

Reputation: 130

Fuzzywuzzy can help you with the required solution. First remove '@'and domain name from the string using regex. You will be having 2 string as follows afterwards -

from fuzzywuzzy import fuzz as fz
str1 = "Abd_tml_1132"
str2 = "Abdullah temel"

count_ratio = fz.ratio(str1,str2)
print(count_ratio)

Output -

46

Upvotes: 0

Rahul Agarwal
Rahul Agarwal

Reputation: 4100

No Direct package but this can solve your problem:

Making email id into list

a = '[email protected]'
rest = a.split('@', 1)[0] # Removing @
result = ''.join([i for i in rest if not i.isdigit()]) ## Removing digits as no names contains digits in them
list_of_email_words =result.split('_') # making a list of all the words. The separator can be changed from _ or . w.r.t to email id
list_of_email_words = list(filter(None, list_of_email_words )) # remove any blank values

Making Name to a list:

b = 'Abdullah temel'
list_of_name_words =b.split(' ')

Apply fuzzy match to both lists:

score =[]
for i in range(len(list_of_email_words)):
    for j in range(len(list_of_name_words)):
        d = fuzz.partial_ratio(list_of_email_words[i],list_of_name_words[j])
        score.append(d)

Now you just need to check if any of the elements of score is greater than a threshold which can be defined by you. For example:

threshold = 70
if any(x>threshold for x in score):
    print ("matched")

Upvotes: 1

Related Questions