Fuzzy Matching 2 DataFrames on multiple columns which includes one column with Float Values

Question

I have 2 DataFrames namely 'Master_data_df' & 'My_records_df'. I am required to find out records which are missed out from 'Master_data_df' by comparing with 'My_records_df'.

Here the column 'Cleint_Name' is a string and there is no exact match in 2 dataframes. Similarly, the column 'Transaction_Value' is a float and again the values varies slightly.

Master_data_df = pd.DataFrame({'Client_Name': ['Royal DUTCH Shell', 'Royal Dutch Shell' , 
                'China State Construction Engineering', 'CHINA STATE CONSTRUCTION 
                  ENGINEERING'],
                  'Transaction_Value': [23455.25, 6782.67, 35672.76 , 1000.15]})

My_records_df = pd.DataFrame({'Client': ['Rayal Duch Shel', 'China National Petrolium', 
                'Arcellor Mittal' , 'China State Constrn Engg'],
                 'Value': [23455.98, 98426.32 , 45393.62, 35672.15})

I am looking for output as below. The Row with NaN value gives me records which are missing in My_records_df.

I have tried below:

import pandas as pd
import fuzzy_pandas as fpd
Final_Report_DF = fpd.fuzzy_merge(RMaster_data_df , My_records_df,
                        left_on= ['Client_Name' , 'Transaction_Value'],
                        right_on= ['Client' , 'Value'],
                        method= 'levenshtein',
                        threshold= 0.8)

This gives me a blank dataframe. (Note: fpd.fuzzy_merge does not allow me to use option 'join')

Request guidance in solving this.

Azhar Khan · Accepted Answer

You have to fix three things in above example:

Convert float to string (float comparison is a better way to find "closely matching" numbers than Levenshtein similarity score).
Tune threshold to a lower value like 0.6.
Add param join='left-outer'.

Full example:

Master_data_df = pd.DataFrame({
    'Client_Name': ['Royal DUTCH Shell', 'Royal Dutch Shell', 'China State Construction Engineering', 'CHINA STATE CONSTRUCTION ENGINEERING'],
    'Transaction_Value': [23455.25, 6782.67, 35672.76 , 1000.15]
    })

My_records_df = pd.DataFrame({
    'Client': ['Rayal Duch Shel', 'China National Petrolium', 'Arcellor Mittal', 'China State Constrn Engg'],
    'Value': [23455.98, 98426.32 , 45393.62, 35672.15]
    })

# Cast float to string to be used for "levenshtein" distance calculation.
Master_data_df["Transaction_Value"] = Master_data_df["Transaction_Value"].astype(str)
My_records_df["Value"] = My_records_df["Value"].astype(str)

import pandas as pd
import fuzzy_pandas as fpd
Final_Report_DF = fpd.fuzzy_merge(Master_data_df, My_records_df,
                        left_on=['Client_Name', 'Transaction_Value'],
                        right_on=['Client', 'Value'],
                        method='levenshtein',
                        threshold=0.6,
                        join='left-outer')

Output:

                            Client_Name Transaction_Value                    Client     Value
0                     Royal DUTCH Shell          23455.25           Rayal Duch Shel  23455.98
1  China State Construction Engineering          35672.76  China State Constrn Engg  35672.15
2                     Royal Dutch Shell           6782.67                                    
3  CHINA STATE CONSTRUCTION ENGINEERING           1000.15                                    ```

Fuzzy Matching 2 DataFrames on multiple columns which includes one column with Float Values

Answers (1)

Related Questions