Reputation: 710
Suppose I have a list of sports like this :
sports=["futball","fitbal","football","tennis","tenis","tenisse","footbal","zennis","ping-pong"]
I would like to create a dataframe that match each element of sport with it's closest if the fuzzy matching is superior than 0.5 and if it's not just match it with itself. (I want to use the function fuzzywuzzy.fuzz.ratio(x,y) for that)
The result should look like :
pd.DataFrame({"sport":sports,"closest_match":["futball","futball","football","tennis","tennis","tennis","futball","tennis","ping-pong"]})
sport closest_match
0 futball futball
1 fitbal futball
2 football football
3 tennis tennis
4 tenis tennis
5 tenisse tennis
6 footbal futball
7 zennis tennis
8 ping-pong ping-pong
Thanks
Upvotes: 0
Views: 187
Reputation: 16997
here is a solution using itertools.combinations:
from fuzzywuzzy import fuzz
import pandas as pd
sports = ["futball", "fitbal", "football", "tennis", "tenis", "tenisse", "footbal", "zennis", "ping-pong"]
dist = ([x for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
df = pd.DataFrame(dist, columns=["sport","closest"])
df['ratio'] = dist = ([fuzz.ratio(*x) for x in itertools.combinations(sports, 2) if fuzz.ratio(*x) > 50])
print(df)
df = df.groupby(['sport'])[['closest','ratio']].agg('max').reset_index()
output:
sport closest ratio
0 fitbal football 77
1 football footbal 93
2 futball football 80
3 tenis zennis 83
4 tenisse zennis 62
5 tennis zennis 91
Upvotes: 1