Ollie Perkins
Ollie Perkins

Reputation: 343

Match pandas value on all and row specific values

I have two large data frames, and I want to match the first against two criteria in the second.

However, I want the first criteria to match against all values in the relevant column of the second, whilst I want the second criteria to be pulled from the specific row of the second and matched to the specific value of the first.

I don't want to overwrite all values from the second into the first, only when they meet specific criteria. In the example below, I would like to overwrite z's in df1['output'] but the x's to be left. (This makes sense in my more complex case.)

I can't figure out how to apply this to the whole dataframe, as of course python spits out an error when trying to compare one value to a series.

Here is my dummy code:

import pandas as pd
import numpy as np


criteria1 = np.random.normal(size = 100)
criteria2 = ['y', 'n'] * 50
output = ['z', 'x'] * 50

d = {'criteria1': criteria1, 'criteria2': criteria2}
d1 = {'criteria1': criteria1, 'output': output}
df1 = pd.DataFrame(d1, columns = ['criteria1', 'output'])
df2 = pd.DataFrame(d, columns = ['criteria1', 'criteria2'])

def dbl_match(x, y):
    if ((x['criteria1'] in y['criteria1'].values) & (y['criteria2'] == 'y')):
        temp = 'hit'
    else:
        temp = x['output']
    return(temp)

df1['output'] = df1.apply(dbl_match, 1, y = df2)

EDIT: To note that the two dataframes in my actual case are not the same length.

Upvotes: 0

Views: 103

Answers (1)

It_is_Chris
It_is_Chris

Reputation: 14083

Try using np.select

conditions = [
((df1['criteria1'] in df2['criteria1'].values) & (df2['criteria2'] == 'y'))

]
choices = [
'hit',
]

df1['output'] = np.select(conditions, choices, default = df1['output'])

output:

    criteria1   output
0   0.126479    hit
1   1.095249    x
2   1.216341    hit
3   -0.935278   x
4   0.088585    hit

New code per comment:

import pandas as pd
import numpy as np


criteria1 = np.random.normal(size = 100)
criteria2 = ['y', 'n'] * 50
output = ['z', 'x'] * 50

d = {'criteria1': criteria1, 'criteria2': criteria2}
d1 = {'criteria1': criteria1, 'output': output}
df1 = pd.DataFrame(d1, columns = ['criteria1', 'output'])
df2 = pd.DataFrame(d, columns = ['criteria1', 'criteria2'])

split_point = len(df2)-90
df2 = df2[0:split_point] #length of df2 is now only 10



conditions = [
((df1['criteria1'].isin(df2['criteria1'].values)) & (df2['criteria2'] == 'y'))

]
choices = [
'hit',
]

df1['output'] = np.select(conditions, choices, default = df1['output'])
df1

new output:

    criteria1   output
0   0.335414    hit
1   -1.735475   x
2   0.003568    hit
3   -1.212202   x
4   1.007000    hit
5   -0.934410   x
6   -1.767913   hit
7   -0.966843   x
8   0.110457    hit
9   -1.426420   x
10  -0.617043   z
11  0.771105    x
12  0.483416    z
13  0.501804    x
14  -1.791317   z

Upvotes: 1

Related Questions