fellowCoder
fellowCoder

Reputation: 69

TypeError while doing fuzzy matching

I am getting a TypeError while doing fuzzy matching between 2 columns in 2 different dataframes. I have already taken care of nan's and also converted the datatype to string but it still fails. Also I'm not able to figure out which value is causing this error. I have already tried doing a match one by one by using for loop, but then the code never fails. Also, I dont want to use a for loop for this.

The error message is: TypeError: expected string or bytes-like object

The code is:

a = df1['ColAddress1'].dropna()  
b = df2['ColAddress2'].dropna()
match = process.extractOne(a, b, scorer=fuzz.partial_token_sort_ratio)

I cannot share the data but it contains 4 type of chars: alphabets [a-zA-Z], numbers, dash(-) and square brackets ([])

Anyone has any idea how can I resolve this.

Upvotes: 0

Views: 275

Answers (1)

Akshay Sehgal
Akshay Sehgal

Reputation: 19307

Better alternative to your goal

Complete code for getting best match between 2 lists/series of strings -

  1. Use itertools for getting combinations of a and b lists/series.
  2. Use the scorer from `Fuzz directly on each combination.
  3. Use np.argmax to get index of the highest score
  4. Fetch the tuple with the 2 strings that have the best match.
import itertools
from fuzzywuzzy import fuzz
import numpy as np

a = ['hi','there']  
b = ['hello','their']

scores = [fuzz.partial_token_sort_ratio(i, j) for i,j in itertools.product(a,b)]
list(itertools.product(a,b))[np.argmax(scores)]
('there', 'their')

Addressing the issue

The process.extractOne expects a query and choices. It returns the best match for the query from the choices.

Query is a string and Choices are the list/Series of strings you want to compare. Currently, you are passing it 2 Series. Instead use a loop over one of the Series to get best matches with the Choices from the other.

from fuzzywuzzy import fuzz, process

a = ['hi','there']  
b = ['hello','their']
match = [(i,*process.extractOne(i, b, scorer=fuzz.partial_token_sort_ratio)) for i in a]
match
[('hi', 'hello', 50), ('there', 'their', 80)] #query, bestchoice, score

If you want the max tuple from this list, just use -

import numpy as np
match[np.argmax([i[2] for i in match])]
('there', 'their', 80)

Upvotes: 1

Related Questions