plyusha
plyusha

Reputation: 33

Creating a column based on the columns which may contain Nan values in pandas

I have 3 columns which should be weighted and summed. However sometimes there are Nan values in these colums and this influences the final set of columns which are being weighted and summed. Further the example df:

import numpy as np
import pandas as pd

f = { 'A': [1, np.nan, 2, np.nan, 5, 6, np.nan],
'B': [np.nan, np.nan, 1, 1, 1, np.nan, 7], 
'C': [np.nan, 2, 3, 6, np.nan, 5, np.nan]}
fd = pd.DataFrame(data = f)
fd.head(10)

      A  B   C
0   1.0 NaN NaN
1   NaN NaN 2.0
2   2.0 1.0 3.0
3   NaN 1.0 6.0
4   5.0 1.0 NaN
5   6.0 NaN 5.0
6   NaN 7.0 NaN

This example demonstrates all the possible combinations of Nan in columns. Then I would like to create column F, which is a weighted sum of columns A, B and C when they are not Nan. Here is my code:

def scaler(df):
"Scaling and summing"
if (pd.notnull(df['A']) == True & pd.notnull(df['B']) == True & pd.notnull(df['C']) == True):
    return df['A']*0.5+df['B']*0.25+df['C']*0.25
elif (pd.notnull(df['A']) == True & pd.notnull(df['B']) == False & pd.notnull(df['C']) == False):
    return df['A']*1
elif (pd.notnull(df['A']) == True & pd.notnull(df['B']) == True & pd.notnull(df['C']) == False):
    return df['A']*0.75+df['B']*0.25
elif (pd.notnull(df['A']) == True & pd.notnull(df['B']) == False & pd.notnull(df['C']) == True):
    return df['A']*0.75+df['C']*0.25
elif (pd.notnull(df['A']) == False & pd.notnull(df['B']) == True & pd.notnull(df['C']) == True):
    return df['B']*0.5+df['C']*0.5
elif (pd.notnull(df['A']) == False & pd.notnull(df['B']) == True & pd.notnull(df['C']) == False):
    return df['B']*1
else: 
    return df['C']*1

fd['F'] =fd.apply(scaler, axis = 'columns')
fd.head(10)

     A   B   C   F
0   1.0 NaN NaN NaN
1   NaN NaN 2.0 NaN
2   2.0 1.0 3.0 2.0
3   NaN 1.0 6.0 6.0
4   5.0 1.0 NaN NaN
5   6.0 NaN 5.0 5.0
6   NaN 7.0 NaN 7.0

So, I get a df where correctly weighted and summed only columns with all three non Nan values. In case there is at least one Nan in one of the columns I get either Nan or incorrect resulting value in column F.

To overcome this issue I've substited all the Nan values in the original df with some float which is out of range for all the columns and then presented above code logic works perfect. My questions are:

1) why does it happen (all the Nan values flipping around the results though columns containing those values are not participating directly in the formulas which are retuned)?

2) The way I've overcome the problem I find a bit slopy. Is there maybe more elegant solution?

Upvotes: 1

Views: 75

Answers (1)

jpp
jpp

Reputation: 164823

You are misunderstanding how pd.DataFrame.apply works. Along axis=1, each row is passed to the function, not the entire dataframe. It's useful to name your function argument accordingly.

You are working with scalars not series within your function, and should use regular and instead of &. Also note that pd.isnull exists as well as pd.notnull. So you can rewrite as follows:

def scaler(row):
    "Scaling and summing"
    if pd.notnull(row['A']) and pd.notnull(row['B']) and pd.notnull(row['C']):
        return row['A']*0.5 + row['B']*0.25 + row['C']*0.25
    elif pd.notnull(row['A']) and pd.isnull(row['B']) and pd.isnull(row['C']):
        return row['A']
    elif pd.notnull(row['A']) and pd.notnull(row['B']) and pd.isnull(row['C']):
        return row['A']*0.75 + row['B']*0.25
    elif pd.notnull(row['A']) and pd.isnull(row['B']) and pd.notnull(row['C']):
        return row['A']*0.75 + row['C']*0.25
    elif pd.isnull(row['A']) and pd.notnull(row['B']) and pd.notnull(row['C']):
        return row['B']*0.5 + row['C']*0.5
    elif pd.isnull(row['A']) and pd.notnull(row['B']) and pd.isnull(row['C']):
        return row['B']
    else: 
        return row['C']

df['F'] = df.apply(scaler, axis=1)

But this is inefficient for a large number of rows. More efficient, and readable, is a solution using np.select. These use vectorised operations only. Notice we compute only once the checking for whether values are null within each series.

a_null = df['A'].isnull()
b_null = df['B'].isnull()
c_null = df['C'].isnull()

conds = [~a_null & b_null & c_null,
         a_null & ~b_null & c_null,
         a_null & b_null & ~c_null,
         ~a_null & ~b_null & c_null,
         ~a_null & b_null & ~c_null,
         a_null & ~b_null & ~c_null,
         ~a_null & ~b_null & ~c_null]

choices = [df['A'], df['B'], df['C'],
           0.75 * df['A'] + 0.25 * df['B'],
           0.75 * df['A'] + 0.25 * df['C'],
           0.5 * df['B'] + 0.5 * df['C'],
           0.5 * df['A'] + 0.25 * df['B'] + 0.25 * df['C']]

df['F'] = np.select(conds, choices)

Result:

     A    B    C     F
0  1.0  NaN  NaN  1.00
1  NaN  NaN  2.0  2.00
2  2.0  1.0  3.0  2.00
3  NaN  1.0  6.0  3.50
4  5.0  1.0  NaN  4.00
5  6.0  NaN  5.0  5.75
6  NaN  7.0  NaN  7.00

Upvotes: 1

Related Questions