Sam
Sam

Reputation: 47

Fuzzy Matching Two Columns in the Same Dataframe Using Python

I have two datasets within the same data frame each showing a list of companies. One dataset is from 2017 and the other is from this year. I am trying to match the two company datasets to each other and figured fuzzy matching ( FuzzyWuzzy) was the best way to do this. Using a partial ratio, I want to simply have the columns with the values listed as so: last year company's name, highest fuzzy matching ratio, this year company associated with that highest score. The original data frame has been given the variable "data" with last year company names under the column "Company" and this year company names under the column "Company name". To accomplish this task, I tried to create a function with the extractOne fuzzy matching process and then apply that function to each value/row in the dataframe. I would then add the results to my original data frame.

Here is the code below:

names_array=[]
ratio_array=[]
def match_names(last_year,this_year):
    for row in last_year:
    x=process.extractOne(row,this_year)
    names_array.append(x[0])
    ratio_array.append(x[1])
return names_array,ratio_array


#last year company names dataset
last_year=data['Company'].dropna().values

#this year companydataset

this_year=data['Company name'].values

name_match,ratio_match=match_names(last_year,this_year)

data['this_year']=pd.Series(name_match)
data['match_rating']=pd.Series(ratio_match)

data.to_csv("test.csv")

However, every time I execute this part of the code, the two added columns I created, do not show up in the csv. In fact, "test.csv" is just the same data frame as before despite the computer showing it as recently created. If anyone could point out the problem or help me out in any way, it would truly be appreciated.

Edit ( data frame preview):

          Company                Company name
0                   BODYPHLO  SPORTIQUE                         NaN
1                        JOSEPH A PERRY                         NaN
2                PCH RESORT TENNIS SHOP                         NaN
3              GREYSTONE GOLF CLUB INC.                         NaN
4                 MUSGROVE COUNTRY CLUB                         NaN
5           CITY OF PELHAM RACQUET CLUB                         NaN
6                 NORTHRIVER YACHT CLUB                         NaN
7                           LAKE FOREST                         NaN
8                   TNL TENNIS PRO SHOP                         NaN
9                SOUTHERN ATHLETIC CLUB                         NaN
10           ORANGE BEACH TENNIS CENTER                         NaN

Then after the Company entries (last year company data sets) end, the "Company name" column ( this year company data sets) begins as so:

4168                                NaN                LEWIS TENNIS
4169                                NaN          CHUCKS PRO SHOP AT
4170                                NaN                CHUCK KINYON
4171                                NaN   LAKE COUNTRY RACQUET CLUB
4172                                NaN   SPORTS ACADEMY & RAC CLUB

Upvotes: 0

Views: 5839

Answers (1)

rahlf23
rahlf23

Reputation: 9019

Your dataframe structure is odd considering that one column only begins once the other end, however we can make it work. Let's take the following sample dataframe for data that you supplied:

                        Company               Company name
0           BODYPHLO  SPORTIQUE                        NaN
1                JOSEPH A PERRY                        NaN
2        PCH RESORT TENNIS SHOP                        NaN
3      GREYSTONE GOLF CLUB INC.                        NaN
4         MUSGROVE COUNTRY CLUB                        NaN
5   CITY OF PELHAM RACQUET CLUB                        NaN
6         NORTHRIVER YACHT CLUB                        NaN
7                   LAKE FOREST                        NaN
8           TNL TENNIS PRO SHOP                        NaN
9        SOUTHERN ATHLETIC CLUB                        NaN
10   ORANGE BEACH TENNIS CENTER                        NaN
11                          NaN               LEWIS TENNIS
12                          NaN         CHUCKS PRO SHOP AT
13                          NaN               CHUCK KINYON
14                          NaN  LAKE COUNTRY RACQUET CLUB
15                          NaN  SPORTS ACADEMY & RAC CLUB

Then perform your matching:

import pandas as pd
from fuzzywuzzy import process, fuzz

known_list = data['Company name'].dropna()

def find_match(x):

    match = process.extractOne(x['Company'], known_list, scorer=fuzz.partial_token_sort_ratio)
    return pd.Series([match[0], match[1]])

data[['this year','match_rating']] = data.dropna(subset=['Company']).apply(find_match, axis=1, result_type='expand')

Yields:

                        Company Company name                  this year  \
0           BODYPHLO  SPORTIQUE          NaN  SPORTS ACADEMY & RAC CLUB   
1                JOSEPH A PERRY          NaN         CHUCKS PRO SHOP AT   
2        PCH RESORT TENNIS SHOP          NaN               LEWIS TENNIS   
3      GREYSTONE GOLF CLUB INC.          NaN  LAKE COUNTRY RACQUET CLUB   
4         MUSGROVE COUNTRY CLUB          NaN  LAKE COUNTRY RACQUET CLUB   
5   CITY OF PELHAM RACQUET CLUB          NaN  LAKE COUNTRY RACQUET CLUB   
6         NORTHRIVER YACHT CLUB          NaN  LAKE COUNTRY RACQUET CLUB   
7                   LAKE FOREST          NaN  LAKE COUNTRY RACQUET CLUB   
8           TNL TENNIS PRO SHOP          NaN               LEWIS TENNIS   
9        SOUTHERN ATHLETIC CLUB          NaN  SPORTS ACADEMY & RAC CLUB   
10   ORANGE BEACH TENNIS CENTER          NaN               LEWIS TENNIS   

    match_rating  
0           47.0  
1           43.0  
2           67.0  
3           43.0  
4           67.0  
5           72.0  
6           48.0  
7           64.0  
8           67.0  
9           50.0  
10          67.0 

Upvotes: 2

Related Questions