Reputation: 69
I am getting a TypeError while doing fuzzy matching between 2 columns in 2 different dataframes. I have already taken care of nan's and also converted the datatype to string but it still fails. Also I'm not able to figure out which value is causing this error. I have already tried doing a match one by one by using for loop, but then the code never fails. Also, I dont want to use a for loop for this.
The error message is: TypeError: expected string or bytes-like object
The code is:
a = df1['ColAddress1'].dropna()
b = df2['ColAddress2'].dropna()
match = process.extractOne(a, b, scorer=fuzz.partial_token_sort_ratio)
I cannot share the data but it contains 4 type of chars: alphabets [a-zA-Z], numbers, dash(-) and square brackets ([])
Anyone has any idea how can I resolve this.
Upvotes: 0
Views: 275
Reputation: 19307
Complete code for getting best match between 2 lists/series of strings -
itertools
for getting combinations of a and b lists/series.np.argmax
to get index of the highest scoreimport itertools
from fuzzywuzzy import fuzz
import numpy as np
a = ['hi','there']
b = ['hello','their']
scores = [fuzz.partial_token_sort_ratio(i, j) for i,j in itertools.product(a,b)]
list(itertools.product(a,b))[np.argmax(scores)]
('there', 'their')
The process.extractOne
expects a query
and choices
. It returns the best match for the query from the choices.
Query
is a string and Choices
are the list/Series of strings you want to compare. Currently, you are passing it 2 Series. Instead use a loop over one of the Series to get best matches with the Choices
from the other.
from fuzzywuzzy import fuzz, process
a = ['hi','there']
b = ['hello','their']
match = [(i,*process.extractOne(i, b, scorer=fuzz.partial_token_sort_ratio)) for i in a]
match
[('hi', 'hello', 50), ('there', 'their', 80)] #query, bestchoice, score
If you want the max tuple from this list, just use -
import numpy as np
match[np.argmax([i[2] for i in match])]
('there', 'their', 80)
Upvotes: 1