Reputation: 123
I need to be able to compare two dataframes, one with one column, and one with two columns, like this:
import numpy as np
import pandas as pd
df_1 = pd.DataFrame(columns=list('AB'))
df_1['A'] = np.random.randint(00,99,size=(5))
df_2 = pd.DataFrame(columns=list('XY'))
df_2['X'] = np.arange(0,100,0.1)
df_2['Y'] = np.cos(df_2['X']) + 30
Now, I want to compare df_1['A'] and df_2['X'] to find matching values, and then create a second column in df_1 (aka df_1['B']) with a value from df_2['Y'] that corresponds to the matching df_2['X'] value. Does anyone have a solution?
If there isn't an exact matching value between the first two columns of the dataframes, is there a way to match the next closest value (with a threshold of ~5%)?
Upvotes: 0
Views: 329
Reputation: 4618
As mentioned in the OP, you may want to also capture the closest value to the df_1['A'] list if there is not an exact match in df_2['X']...to do this, you can try the following:
define your dfs as per OP:
df_1 = pd.DataFrame(columns=list('AB'))
df_1['A'] = np.random.randint(00,99,size=(5))
df_2 = pd.DataFrame(columns=list('XY'))
df_2['X'] = np.arange(0,100,0.1)
df_2['Y'] = np.cos(df_2['X']) + 30 #changed "line_x"
first define a function which will find the closest value:
import numpy as np
def find_nearest(df, in_col, value, out_col): #args = input df (df_2 here), column to match against ('X' here), value to match in in_col (values in df_1['A'] here), column with data you want ('Y' here)
array = np.asarray(df[in_col])
idx = (np.abs(array - value)).argmin()
return df.iloc[idx][out_col]
then get all the df_2['Y'] values you want:
matching_vals=[] #declare empty list of matching values from df_2['Y'] to add to df_1['B']
for A in df_1['A'].values: #loop through all df_1['A'] values
if A in df_2['X']: # if exact match
matching_vals.append(float(df_2[df_2['X']==A]['Y'])) #append corresponding df_2['Y'] value to list
else: #no exact match
matching_vals.append(find_nearest(df_2,'X',A,'Y')) #append df_2['Y'] value with closest match in df_2['X'] column
finally, add it to the original df_1:
df_1['B']=matching_vals
This example works for the dfs that you have provided, but you may have to fiddle slightly with the steps to work with your real data...
you can also add one more if statement if you want to enforce the 5% threshold rule..and if it doesn't pass, just append nan to the list (or whatever works best for you)
Upvotes: 1
Reputation: 3108
df_2.merge(df_1, left_on=['X'], right_on=['A']).rename({'Y':'B', axis='columns')
The merge filter the common value between df_1['A']
and the df_2['X']
and after you rename 'Y' into 'B'.
Upvotes: 0