Reputation: 343
I have two large data frames, and I want to match the first against two criteria in the second.
However, I want the first criteria to match against all values in the relevant column of the second, whilst I want the second criteria to be pulled from the specific row of the second and matched to the specific value of the first.
I don't want to overwrite all values from the second into the first, only when they meet specific criteria. In the example below, I would like to overwrite z's in df1['output'] but the x's to be left. (This makes sense in my more complex case.)
I can't figure out how to apply this to the whole dataframe, as of course python spits out an error when trying to compare one value to a series.
Here is my dummy code:
import pandas as pd
import numpy as np
criteria1 = np.random.normal(size = 100)
criteria2 = ['y', 'n'] * 50
output = ['z', 'x'] * 50
d = {'criteria1': criteria1, 'criteria2': criteria2}
d1 = {'criteria1': criteria1, 'output': output}
df1 = pd.DataFrame(d1, columns = ['criteria1', 'output'])
df2 = pd.DataFrame(d, columns = ['criteria1', 'criteria2'])
def dbl_match(x, y):
if ((x['criteria1'] in y['criteria1'].values) & (y['criteria2'] == 'y')):
temp = 'hit'
else:
temp = x['output']
return(temp)
df1['output'] = df1.apply(dbl_match, 1, y = df2)
EDIT: To note that the two dataframes in my actual case are not the same length.
Upvotes: 0
Views: 103
Reputation: 14083
Try using np.select
conditions = [
((df1['criteria1'] in df2['criteria1'].values) & (df2['criteria2'] == 'y'))
]
choices = [
'hit',
]
df1['output'] = np.select(conditions, choices, default = df1['output'])
output:
criteria1 output
0 0.126479 hit
1 1.095249 x
2 1.216341 hit
3 -0.935278 x
4 0.088585 hit
New code per comment:
import pandas as pd
import numpy as np
criteria1 = np.random.normal(size = 100)
criteria2 = ['y', 'n'] * 50
output = ['z', 'x'] * 50
d = {'criteria1': criteria1, 'criteria2': criteria2}
d1 = {'criteria1': criteria1, 'output': output}
df1 = pd.DataFrame(d1, columns = ['criteria1', 'output'])
df2 = pd.DataFrame(d, columns = ['criteria1', 'criteria2'])
split_point = len(df2)-90
df2 = df2[0:split_point] #length of df2 is now only 10
conditions = [
((df1['criteria1'].isin(df2['criteria1'].values)) & (df2['criteria2'] == 'y'))
]
choices = [
'hit',
]
df1['output'] = np.select(conditions, choices, default = df1['output'])
df1
new output:
criteria1 output
0 0.335414 hit
1 -1.735475 x
2 0.003568 hit
3 -1.212202 x
4 1.007000 hit
5 -0.934410 x
6 -1.767913 hit
7 -0.966843 x
8 0.110457 hit
9 -1.426420 x
10 -0.617043 z
11 0.771105 x
12 0.483416 z
13 0.501804 x
14 -1.791317 z
Upvotes: 1