Reputation: 67
I have 2 DataFrames namely 'Master_data_df' & 'My_records_df'. I am required to find out records which are missed out from 'Master_data_df' by comparing with 'My_records_df'.
Here the column 'Cleint_Name' is a string and there is no exact match in 2 dataframes. Similarly, the column 'Transaction_Value' is a float and again the values varies slightly.
Master_data_df = pd.DataFrame({'Client_Name': ['Royal DUTCH Shell', 'Royal Dutch Shell' ,
'China State Construction Engineering', 'CHINA STATE CONSTRUCTION
ENGINEERING'],
'Transaction_Value': [23455.25, 6782.67, 35672.76 , 1000.15]})
My_records_df = pd.DataFrame({'Client': ['Rayal Duch Shel', 'China National Petrolium',
'Arcellor Mittal' , 'China State Constrn Engg'],
'Value': [23455.98, 98426.32 , 45393.62, 35672.15})
I am looking for output as below. The Row with NaN value gives me records which are missing in My_records_df.
I have tried below:
import pandas as pd
import fuzzy_pandas as fpd
Final_Report_DF = fpd.fuzzy_merge(RMaster_data_df , My_records_df,
left_on= ['Client_Name' , 'Transaction_Value'],
right_on= ['Client' , 'Value'],
method= 'levenshtein',
threshold= 0.8)
This gives me a blank dataframe. (Note: fpd.fuzzy_merge does not allow me to use option 'join')
Request guidance in solving this.
Upvotes: 1
Views: 1016
Reputation: 4098
You have to fix three things in above example:
threshold
to a lower value like 0.6
.join='left-outer'
.Full example:
Master_data_df = pd.DataFrame({
'Client_Name': ['Royal DUTCH Shell', 'Royal Dutch Shell', 'China State Construction Engineering', 'CHINA STATE CONSTRUCTION ENGINEERING'],
'Transaction_Value': [23455.25, 6782.67, 35672.76 , 1000.15]
})
My_records_df = pd.DataFrame({
'Client': ['Rayal Duch Shel', 'China National Petrolium', 'Arcellor Mittal', 'China State Constrn Engg'],
'Value': [23455.98, 98426.32 , 45393.62, 35672.15]
})
# Cast float to string to be used for "levenshtein" distance calculation.
Master_data_df["Transaction_Value"] = Master_data_df["Transaction_Value"].astype(str)
My_records_df["Value"] = My_records_df["Value"].astype(str)
import pandas as pd
import fuzzy_pandas as fpd
Final_Report_DF = fpd.fuzzy_merge(Master_data_df, My_records_df,
left_on=['Client_Name', 'Transaction_Value'],
right_on=['Client', 'Value'],
method='levenshtein',
threshold=0.6,
join='left-outer')
Output:
Client_Name Transaction_Value Client Value
0 Royal DUTCH Shell 23455.25 Rayal Duch Shel 23455.98
1 China State Construction Engineering 35672.76 China State Constrn Engg 35672.15
2 Royal Dutch Shell 6782.67
3 CHINA STATE CONSTRUCTION ENGINEERING 1000.15 ```
Upvotes: 1