Reputation: 17631
I have the following pandas DataFrame.
import pandas as pd
df = pd.read_csv('filename.csv')
print(df)
dog A B C
0 dog1 0.787575 0.159330 0.053095
1 dog10 0.770698 0.169487 0.059815
2 dog11 0.792689 0.152043 0.055268
3 dog12 0.785066 0.160361 0.054573
4 dog13 0.795455 0.150464 0.054081
5 dog14 0.794873 0.150700 0.054426
.. ....
8 dog19 0.811585 0.140207 0.048208
9 dog2 0.797202 0.152033 0.050765
10 dog20 0.801607 0.145137 0.053256
11 dog21 0.792689 0.152043 0.055268
....
I create a new column by summing columns "A"
, "B"
, "C"
as follows:
df['total_ABC'] = df[["A", "B", "B"]].sum(axis=1)
Now I would like to do this based on a conditional, i.e. if "A" < 0.78
then create a new summed column df['smallA_sum'] = df[["A", "B", "B"]].sum(axis=1)
. Otherwise, the value should be zero.
How does one create conditional statements like this?
My thought would be to use
df['smallA_sum'] = df1.apply(lambda row: (row['A']+row['B']+row['C']) if row['A'] < 0.78))
However, this doesn't work and I'm not able to specify axis.
How do you create a column based on the values of other columns?
You could also do something like for each df['dog'] == 'dog2'
, create column dog2_sum
, i.e.
df['dog2_sum'] = df1.apply(lambda row: (row['A']+row['B']+row['C']) if df['dog'] == 'dog2'))
but my approach is incorrect.
Upvotes: 27
Views: 133350
Reputation: 23081
Another approach is to use numpy.where()
method to select values. It returns elements chosen from the sum result if the condition is met, 0 otherwise. Due to a lower overhead, numpy methods are usually faster than their pandas cousins. Barring numba-jitted or Cython loops, this is the fastest approach for this specific task.
import numpy as np
df['Total'] = np.where(df['A'] < 0.78, df[['A','B','C']].sum(axis=1), 0)
or
df['total'] = np.where(df['dog'] == 'dog2', df[['A','B','C']].sum(axis=1), 0)
Upvotes: 1
Reputation: 394011
The following should work, here we mask the df where the condition is met, this will set NaN
to the rows where the condition isn't met so we call fillna
on the new col:
In [67]:
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df
Out[67]:
A B C
0 0.197334 0.707852 -0.443475
1 -1.063765 -0.914877 1.585882
2 0.899477 1.064308 1.426789
3 -0.556486 -0.150080 -0.149494
4 -0.035858 0.777523 -0.453747
In [73]:
df['total'] = df.loc[df['A'] > 0,['A','B']].sum(axis=1)
df['total'].fillna(0, inplace=True)
df
Out[73]:
A B C total
0 0.197334 0.707852 -0.443475 0.905186
1 -1.063765 -0.914877 1.585882 0.000000
2 0.899477 1.064308 1.426789 1.963785
3 -0.556486 -0.150080 -0.149494 0.000000
4 -0.035858 0.777523 -0.453747 0.000000
Another approach is to call where
on the sum
result, this takes a value param to return when the condition isn't met:
In [75]:
df['total'] = df[['A','B']].sum(axis=1).where(df['A'] > 0, 0)
df
Out[75]:
A B C total
0 0.197334 0.707852 -0.443475 0.905186
1 -1.063765 -0.914877 1.585882 0.000000
2 0.899477 1.064308 1.426789 1.963785
3 -0.556486 -0.150080 -0.149494 0.000000
4 -0.035858 0.777523 -0.453747 0.000000
Upvotes: 36