Pandas Series and Nan Values for mismatched values

Question

I have these two dictionaries,

dico = {'Name': ['Arthur','Henri','Lisiane','Patrice','Zadig','Sacha'],
        "Age": ["20","18","62","73",'21','20'],
        "Studies": ['Economics','Maths','Psychology','Medical','Cinema','CS']
     
             }
dico2 = {'Surname': ['Arthur1','Henri2','Lisiane3','Patrice4']}

dico = pd.DataFrame.from_dict(dico)
dico2 = pd.DataFrame.from_dict(dico2)

in which I would like to match then append the Surname column with the Name column, to finally append it to dico, for a following output:

      Name   Surname Age     Studies
0   Arthur   Arthur1  20   Economics
1    Henri    Henri2  18       Maths
2  Lisiane  Lisiane3  62  Psychology
3  Patrice  Nan       73     Medical
4    Zadig  Nan       21      Cinema
5    Sacha  Nan       20          CS

and ultimately delete the rows for which Surname is Nan

      Name   Surname Age     Studies
0   Arthur   Arthur1  20   Economics
1    Henri    Henri2  18       Maths
2  Lisiane  Lisiane3  62  Psychology

map_list = []
for name in dico['Name']:
    best_ratio = None
    for idx, surname in enumerate(dico2['Surname']):
        if best_ratio == None:
            best_ratio = fuzz.ratio(name, surname)
            best_idx = 0
        else:
            ratio = fuzz.ratio(name, surname)
            if  ratio > best_ratio:
                best_ratio = ratio
                best_idx = idx
    map_list.append(dico2['Surname'][best_idx]) # obtain surname

dico['Surname'] = pd.Series(map_list) # add column
dico = dico[["Name", "Surname", "Age", "Studies"]] # reorder columns

#if the surname is not a great match, print "Nan"
dico = dico.drop(dico[dico.Surname == "NaN"].index)

but when I print(dico), the output is as follows:

      Name   Surname Age     Studies
0   Arthur   Arthur1  20   Economics
1    Henri    Henri2  18       Maths
2  Lisiane  Lisiane3  62  Psychology
3  Patrice  Patrice4  73     Medical
4    Zadig  Patrice4  21      Cinema
5    Sacha  Patrice4  20          CS

I don't see why after the Patrice row, there's a mismatch, while I want it to be "Nan".

anky · Accepted Answer

Lets try pd.Multiindex.from_product to create combinations and then assign a score with zip and fuzz.ratio and some filtering to create our dict, then we can use series.map and df.dropna:

from fuzzywuzzy import fuzz

comb = pd.MultiIndex.from_product((dico['Name'],dico2['Surname']))
scores = comb.map(lambda x: fuzz.ratio(*x)) #or fuzz.partial_ratio(*x)
d = dict(a for a,b in zip(comb,scores) if b>90) #change threshold
out = dico.assign(SurName=dico['Name'].map(d)).dropna(subset=['SurName'])

print(out)

      Name Age     Studies   SurName
0   Arthur  20   Economics   Arthur1
1    Henri  18       Maths    Henri2
2  Lisiane  62  Psychology  Lisiane3
3  Patrice  73     Medical  Patrice4

Pandas Series and Nan Values for mismatched values

Answers (2)

Related Questions