hk_03
hk_03

Reputation: 311

New column in DataFrame based on conditions

I have a DataFrame like:

+------------+---------------+-------------+---------------------+-------------------+
| SK_ID_CURR | CREDIT_ACTIVE | DAYS_CREDIT | DAYS_CREDIT_ENDDATE | DAYS_ENDDATE_FACT |
+------------+---------------+-------------+---------------------+-------------------+
|     436084 | Sold          |       -2835 | -2094.0             | -2436.0           |
|     436084 | Active        |        -987 | -438.0              | NaN               |
|     436084 | Sold          |       -1875 | -1494.0             | -1494.0           |
|     436084 | Active        |       -1135 | -951.0              | NaN               |
|     436084 | Bad debt      |        -986 | NaN                 | NaN               |
|     436084 | Active        |        -968 | -845.0              | NaN               |
|     436084 | Active        |        -987 | -803.0              | NaN               |
+------------+---------------+-------------+---------------------+-------------------+

I like to create a new column CREDIT_LENGTH_IN_DAYS with the following rules:

def func(x):
    if x[x['CREDIT_ACTIVE'] == 'Active']:
    return  x['DAYS_CREDIT_ENDDATE'] - x['DAYS_CREDIT']
    elif x[x['CREDIT_ACTIVE'] == 'Closed'] | x[x['CREDIT_ACTIVE'] == 'Sold'] :
    return x['DAYS_ENDDATE_FACT'] - x['DAYS_CREDIT']
    elif x[x['CREDIT_ACTIVE'] == 'Bad debt']:
    return x['DAYS_CREDIT']

Then i use:

df_bureau['CREDIT_LENGTH_IN_DAYS'] = df_bureau.apply(func, axis=1)

However when the case is x[x['CREDIT_ACTIVE']=='Bad debt' I get interesting values, not the actual value for each row in x['DAYS_CREDIT'].

Upvotes: 1

Views: 77

Answers (1)

jezrael
jezrael

Reputation: 862431

Use numpy.select:

m1 = df_bureau['CREDIT_ACTIVE'] == 'Active'
m2 = df_bureau['CREDIT_ACTIVE'].isin(['Closed','Sold'])
m3 = df_bureau['CREDIT_ACTIVE'] == 'Bad debt'

v1 = df_bureau['DAYS_CREDIT_ENDDATE'] - df_bureau['DAYS_CREDIT']
v2 = df_bureau['DAYS_ENDDATE_FACT'] - df_bureau['DAYS_CREDIT']
v3 = df_bureau['DAYS_CREDIT']

df_bureau['CREDIT_LENGTH_IN_DAYS'] = np.select([m1, m2, m3], [v1, v2, v3], np.nan)
print (df_bureau)
   SK_ID_CURR CREDIT_ACTIVE  DAYS_CREDIT  DAYS_CREDIT_ENDDATE  \
0      436084          Sold        -2835              -2094.0   
1      436084        Active         -987               -438.0   
2      436084          Sold        -1875              -1494.0   
3      436084        Active        -1135               -951.0   
4      436084      Bad debt         -986                  NaN   
5      436084        Active         -968               -845.0   
6      436084        Active         -987               -803.0   

   DAYS_ENDDATE_FACT  CREDIT_LENGTH_IN_DAYS  
0            -2436.0                  399.0  
1                NaN                  549.0  
2            -1494.0                  381.0  
3                NaN                  184.0  
4                NaN                 -986.0  
5                NaN                  123.0  
6                NaN                  184.0  

Your solution working with each row separately, so filtering is not necessary, also need change | to or because working with scalars:

def func(x):
    if x['CREDIT_ACTIVE'] == 'Active':
        return  x['DAYS_CREDIT_ENDDATE'] - x['DAYS_CREDIT']
    elif (x['CREDIT_ACTIVE'] == 'Closed') or (x['CREDIT_ACTIVE'] == 'Sold'):
        return x['DAYS_ENDDATE_FACT'] - x['DAYS_CREDIT']
    elif x['CREDIT_ACTIVE'] == 'Bad debt':
        return x['DAYS_CREDIT']

df_bureau['CREDIT_LENGTH_IN_DAYS'] = df_bureau.apply(func, axis=1)

Upvotes: 2

Related Questions