Compare two dataframes, and then add new column to one of the data frames based on the other

Question

I need to be able to compare two dataframes, one with one column, and one with two columns, like this:

import numpy as np
import pandas as pd

df_1 = pd.DataFrame(columns=list('AB'))
df_1['A'] = np.random.randint(00,99,size=(5))

df_2  = pd.DataFrame(columns=list('XY'))
df_2['X'] = np.arange(0,100,0.1)
df_2['Y'] = np.cos(df_2['X']) + 30

Now, I want to compare df_1['A'] and df_2['X'] to find matching values, and then create a second column in df_1 (aka df_1['B']) with a value from df_2['Y'] that corresponds to the matching df_2['X'] value. Does anyone have a solution?

If there isn't an exact matching value between the first two columns of the dataframes, is there a way to match the next closest value (with a threshold of ~5%)?

Derek Eden · Accepted Answer

As mentioned in the OP, you may want to also capture the closest value to the df_1['A'] list if there is not an exact match in df_2['X']...to do this, you can try the following:

define your dfs as per OP:

df_1 = pd.DataFrame(columns=list('AB'))
df_1['A'] = np.random.randint(00,99,size=(5))

df_2  = pd.DataFrame(columns=list('XY'))
df_2['X'] = np.arange(0,100,0.1)
df_2['Y'] = np.cos(df_2['X']) + 30 #changed "line_x"

first define a function which will find the closest value:

import numpy as np    
def find_nearest(df, in_col, value, out_col): #args = input df (df_2 here), column to match against ('X' here), value to match in in_col (values in df_1['A'] here), column with data you want ('Y' here)
    array = np.asarray(df[in_col])
    idx = (np.abs(array - value)).argmin()
    return df.iloc[idx][out_col]

then get all the df_2['Y'] values you want:

matching_vals=[] #declare empty list of matching values from df_2['Y'] to add to df_1['B']
for A in df_1['A'].values: #loop through all df_1['A'] values
    if A in df_2['X']: # if exact match
        matching_vals.append(float(df_2[df_2['X']==A]['Y'])) #append corresponding df_2['Y'] value to list
    else: #no exact match
        matching_vals.append(find_nearest(df_2,'X',A,'Y')) #append df_2['Y'] value with closest match in df_2['X'] column

finally, add it to the original df_1:

df_1['B']=matching_vals

This example works for the dfs that you have provided, but you may have to fiddle slightly with the steps to work with your real data...

you can also add one more if statement if you want to enforce the 5% threshold rule..and if it doesn't pass, just append nan to the list (or whatever works best for you)

Compare two dataframes, and then add new column to one of the data frames based on the other

Answers (2)

Related Questions