mnickey
mnickey

Reputation: 787

replacing data frame values with conditions

What I'm trying to accomplish is a conditional check that looks at the value in the amount column. If the value in df['amount'] is less than 200k, nothing is done. If the value in df['amount'] is greater than or equal to 200k, replace the corresponding with either the value in months or 36.00 depending on whichever is greater.

import pandas as pd
df['amount'] = [332374.00, 22250.75, 45282.10, 339720.00, 1100.00, 40000.00, 15000.00, 207820.00, 497432.00]
df['months'] = [18.00, 17.00, 16.00, 46.00, 14.00, 13.00, 13.00, 13.00, 12.00]
df['checks'] = [1.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 1.00, 1.00]

The desired output is a dataset that looks like this.

amount      months  checks
332374.00   36.00   1.00
22250.75    17.00   0.00
45282.10    16.00   0.00
339720.00   46.00   1.00
1100.00     14.00   0.00
40000.00    13.00   0.00
15000.00    13.00   0.00
207820.00   36.00   1.00
497432.00   36.00   1.00

Upvotes: 1

Views: 49

Answers (2)

Alexander
Alexander

Reputation: 109546

You can just use loc to find instances where the amount exceeds your 200k threshold and where months is less than 36. Then just set those values to 36.

df.loc[(df['amount'] >= 200000) & (df['months'] < 36), 'months'] = 36
>>> df
      amount  months  checks
0  332374.00      36       1
1   22250.75      17       0
2   45282.10      16       0
3  339720.00      46       1
4    1100.00      14       0
5   40000.00      13       0
6   15000.00      13       0
7  207820.00      36       1
8  497432.00      36       1

Timings

Using assign will be slower because you have to copy the entire dataframe and then assign the new column. Using loc will be faster.

Here are the timings on this small dataset:

%timeit df.assign(months=df.months.mask(df.amount.ge(2E5), np.maximum(df.months, 36)))
# 1000 loops, best of 3: 1.01 ms per loop

%timeit df.loc[(df['amount'] >= 200000) & (df['months'] < 36), 'months'] = 36
# 1000 loops, best of 3: 838 µs per loop

Upvotes: 3

piRSquared
piRSquared

Reputation: 294338

pd.Series.mask allows you to mask values of the series in the position of the True values passed in the cond argument. You can also pass an other argument to supply the value to replace the ones being masked. I used np.maximum to generate the other values.

df.assign(months=df.months.mask(df.amount.ge(2E5), np.maximum(df.months, 36)))

      amount  months  checks
0  332374.00    36.0     1.0
1   22250.75    17.0     0.0
2   45282.10    16.0     0.0
3  339720.00    46.0     1.0
4    1100.00    14.0     0.0
5   40000.00    13.0     0.0
6   15000.00    13.0     0.0
7  207820.00    36.0     1.0
8  497432.00    36.0     1.0

See also pd.Series.where as we could have performed the same task with:

df.assign(months=df.months.where(df.amount.lt(2E5), np.maximum(df.months, 36)))

      amount  months  checks
0  332374.00    36.0     1.0
1   22250.75    17.0     0.0
2   45282.10    16.0     0.0
3  339720.00    46.0     1.0
4    1100.00    14.0     0.0
5   40000.00    13.0     0.0
6   15000.00    13.0     0.0
7  207820.00    36.0     1.0
8  497432.00    36.0     1.0

We can also perform this task using the same logic but utilizing numpy.where

a = df.amount.values
m = df.months.values
df.assign(months=np.where(a < 2E5, m, np.maximum(m, 36)))

      amount  months  checks
0  332374.00    36.0     1.0
1   22250.75    17.0     0.0
2   45282.10    16.0     0.0
3  339720.00    46.0     1.0
4    1100.00    14.0     0.0
5   40000.00    13.0     0.0
6   15000.00    13.0     0.0
7  207820.00    36.0     1.0
8  497432.00    36.0     1.0

Upvotes: 3

Related Questions