Arli94
Arli94

Reputation: 710

Fuzzy matching inside a column

Suppose I have a list of sports like this :

sports=["futball","fitbal","football","tennis","tenis","tenisse","footbal","zennis","ping-pong"]

I would like to create a dataframe that match each element of sport with it's closest if the fuzzy matching is superior than 0.5 and if it's not just match it with itself. (I want to use the function fuzzywuzzy.fuzz.ratio(x,y) for that)

The result should look like :

pd.DataFrame({"sport":sports,"closest_match":["futball","futball","football","tennis","tennis","tennis","futball","tennis","ping-pong"]})

    sport   closest_match
0   futball futball
1   fitbal  futball
2   football football
3   tennis  tennis
4   tenis   tennis
5   tenisse tennis
6   footbal futball
7   zennis  tennis
8   ping-pong ping-pong

Thanks

Upvotes: 0

Views: 187

Answers (1)

Frenchy
Frenchy

Reputation: 16997

here is a solution using itertools.combinations:

from fuzzywuzzy import fuzz
import pandas as pd
sports = ["futball", "fitbal", "football", "tennis", "tenis", "tenisse", "footbal", "zennis", "ping-pong"]
dist = ([x for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])

df = pd.DataFrame(dist, columns=["sport","closest"])
df['ratio'] = dist = ([fuzz.ratio(*x) for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
print(df)

df = df.groupby(['sport'])[['closest','ratio']].agg('max').reset_index()

output:

      sport   closest  ratio
0    fitbal  football     77
1  football   footbal     93
2   futball  football     80
3     tenis    zennis     83
4   tenisse    zennis     62
5    tennis    zennis     91

Upvotes: 1

Related Questions