Reputation: 693
I would like to create a new column with a numerical value based on the following conditions:
a. if gender is male & pet1==pet2, points = 5
b. if gender is female & (pet1 is 'cat' or pet1 is 'dog'), points = 5
c. all other combinations, points = 0
gender pet1 pet2
0 male dog dog
1 male cat cat
2 male dog cat
3 female cat squirrel
4 female dog dog
5 female squirrel cat
6 squirrel dog cat
I would like the end result to be as follows:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
How do I accomplish this?
Upvotes: 61
Views: 139811
Reputation: 23031
Writing the conditions as a string expression and evaluating it using eval()
is another method to evaluate the condition and assign values to the column using numpy.where()
.
# evaluate the condition
condition = df.eval("gender=='male' and pet1==pet2 or gender=='female' and pet1==['cat','dog']")
# assign values
df['points'] = np.where(condition, 5, 0)
If you have a large dataframe (100k+ rows) and a lot of comparisons to evaluate, this method is probably the fastest pandas method to construct a boolean mask.1
Another advantage of this method over chained &
and/or |
operators (used in the other vectorized answers here) is better readability (arguably).
1: For a dataframe with 105k rows, if you evaluate 4 conditions where each chain two comparisons, eval()
creates a boolean mask substantially faster than chaining bitwise operators.
df = pd.DataFrame([{'gender': 'male', 'pet1': 'dog', 'pet2': 'dog'}, {'gender': 'male', 'pet1': 'cat', 'pet2': 'cat'}, {'gender': 'male', 'pet1': 'dog', 'pet2': 'cat'},{'gender': 'female', 'pet1': 'cat', 'pet2': 'squirrel'},{'gender': 'female', 'pet1': 'dog', 'pet2': 'dog'},{'gender': 'female', 'pet1': 'squirrel', 'pet2': 'cat'},{'gender': 'squirrel', 'pet1': 'dog', 'pet2': 'cat'}]*15_000)
%timeit np.where(df.eval("gender == 'male' and pet1 == pet2 or gender == 'female' and pet1 == ['cat','dog'] or gender == 'female' and pet2 == ['squirrel','dog'] or pet1 == 'cat' and pet2 == 'cat'"), 5, 0)
# 37.9 ms ± 847 µs per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ) | ( (df['gender'] == 'female') & (df['pet2'].isin(['squirrel','dog'] ) ) ) | ( (df['pet1'] == 'cat') & (df['pet2'] == 'cat') ), 5, 0)
# 53.5 ms ± 1.38 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
%timeit np.select([df['gender'].eq('male') & df['pet1'].eq(df['pet2']), df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog']), df['gender'].eq('female') & df['pet2'].isin(['squirrel', 'dog']), df['pet1'].eq('cat') & df['pet2'].eq('cat')], [5,5,5,5], default=0)
# 48.9 ms ± 5.06 ms per loop (mean ± std. dev. of 10 runs, 100 loops each)
Upvotes: 4
Reputation: 28644
One option is with case_when from pyjanitor; under the hood it uses pd.Series.mask
.
The basic idea is a pairing of condition and expected value; you can pass as many pairings as required, followed by a default value and a target column name:
# pip install pyjanitor
import pandas as pd
import janitor
df.case_when(
# condition, value
df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
0, # default
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
You could use strings for the conditions, as long as they can be evaluated by pd.eval
on the parent dataframe - note that speed wise, this can be slower for small datasets:
df.case_when(
"gender == 'male' and pet1 == pet2", 5,
"gender == 'female' and pet2 == ['cat', 'dog']", 5,
0,
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 0
4 female dog dog 5
5 female squirrel cat 5
6 squirrel dog cat 0
Anonymous functions are also possible, which can be handy in chained operations:
df.case_when(
lambda df: df.gender.eq('male') & df.pet1.eq(df.pet2), 5,
lambda df: df.gender.eq('female') & df.pet1.isin(['cat', 'dog']), 5,
0, # default
column_name = 'points')
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
Upvotes: 0
Reputation: 42886
numpy.select
This is a perfect case for np.select
where we can create a column based on multiple conditions and it's a readable method when there are more conditions:
conditions = [
df['gender'].eq('male') & df['pet1'].eq(df['pet2']),
df['gender'].eq('female') & df['pet1'].isin(['cat', 'dog'])
]
choices = [5,5]
df['points'] = np.select(conditions, choices, default=0)
print(df)
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
Upvotes: 76
Reputation: 1822
You can also use the apply
function. For example:
def myfunc(gender, pet1, pet2):
if gender=='male' and pet1==pet2:
myvalue=5
elif gender=='female' and (pet1=='cat' or pet1=='dog'):
myvalue=5
else:
myvalue=0
return myvalue
And then using the apply function by setting axis=1
df['points'] = df.apply(lambda x: myfunc(x['gender'], x['pet1'], x['pet2']), axis=1)
We get:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
Upvotes: 7
Reputation: 845
The apply method described by @RuggeroTurra takes a lot longer for 500k rows. I ended up using something like
df['result'] = ((df.a == 0) & (df.b != 1)).astype(int) * 2 + \
((df.a != 0) & (df.b != 1)).astype(int) * 3 + \
((df.a == 0) & (df.b == 1)).astype(int) * 4 + \
((df.a != 0) & (df.b == 1)).astype(int) * 5
where the apply method took 25 seconds and this method above took about 18ms.
Upvotes: 6
Reputation: 393923
You can do this using np.where
, the conditions use bitwise &
and |
for and
and or
with parentheses around the multiple conditions due to operator precedence. So where the condition is true 5
is returned and 0
otherwise:
In [29]:
df['points'] = np.where( ( (df['gender'] == 'male') & (df['pet1'] == df['pet2'] ) ) | ( (df['gender'] == 'female') & (df['pet1'].isin(['cat','dog'] ) ) ), 5, 0)
df
Out[29]:
gender pet1 pet2 points
0 male dog dog 5
1 male cat cat 5
2 male dog cat 0
3 female cat squirrel 5
4 female dog dog 5
5 female squirrel cat 0
6 squirrel dog cat 0
Upvotes: 55
Reputation: 17670
using apply.
def f(x):
if x['gender'] == 'male' and x['pet1'] == x['pet2']: return 5
elif x['gender'] == 'female' and (x['pet1'] == 'cat' or x['pet1'] == 'dog'): return 5
else: return 0
data['points'] = data.apply(f, axis=1)
Upvotes: 30