armara
armara

Reputation: 557

Replace NaN inside masked dataframe

some_dict = {'a': [1,2,3,4], 'b': [5,6,7,8],}
df = pd.DataFrame(some_dict)
mask1 = pd.Series([False, True, False, True])
df['c'] = df.loc[mask1, 'a']

This will create a new column c with NaN-values where mask1 is False, and the values from column a where mask1 is True. Now I add b to c, on a different condition

mask2 = pd.Series([True, False, False, False])
df['c'] += df.loc[mask2, 'b']

However, this will put ALL values to NaN. I'm guessing this is because NaN + 0 and 0 + NaN are both equal to NaN. I tried solving this by writing

df['c'] = df.loc[mask1, 'a'].fillna(0)
df['c'] = df.loc[mask2, 'b'].fillna(0)

Why is replacing NaN with 0 through fillna() not working?

Upvotes: 1

Views: 1667

Answers (2)

jezrael
jezrael

Reputation: 862771

If check how it working missing values are added only for not matched rows - False valuss in mask:

print (df.loc[mask1, 'a'])
1    2
3    4
Name: a, dtype: int64

So if want replace NaN there are no missing values, so cannot do it:

print (df.loc[mask1, 'a'].fillna(0))
1    2
3    4
Name: a, dtype: int64

If assign to column then not matched values has to be created too - and because pandas no idea what should be is created NaNs:

df['c'] = df.loc[mask1, 'a']
print (df)
   a  b    c
0  1  5  NaN <- False
1  2  6  2.0
2  3  7  NaN <- False
3  4  8  4.0

So if need replace NaN to 0 need numpy.where - if Trues is passed values from a if False is passed 0:

df['c'] = np.where(mask1, df['a'], 0)

print (df)
   a  b  c
0  1  5  0
1  2  6  2
2  3  7  0
3  4  8  4

Another pandas alternative is Series.where:

df['c'] = df['a'].where(mask1, 0)
print (df)
   a  b  c
0  1  5  0
1  2  6  2
2  3  7  0
3  4  8  4

All together:

some_dict = {'a': [1,2,3,4], 'b': [5,6,7,8],}
df = pd.DataFrame(some_dict)
mask1 = pd.Series([False, True, False, True])
df['c'] = np.where(mask1, df['a'], 0)

mask2 = pd.Series([True, False, False, False])
df['c'] += np.where(mask2, df['b'], 0)

print (df)
   a  b  c
0  1  5  5
1  2  6  2
2  3  7  0
3  4  8  4

Another idea is use Series.add with fill_value=0:

some_dict = {'a': [1,2,3,4], 'b': [5,6,7,8],}
df = pd.DataFrame(some_dict)
mask1 = pd.Series([False, True, False, True])
df['c'] = df.loc[mask1, 'a']
print (df)
   a  b    c
0  1  5  NaN
1  2  6  2.0
2  3  7  NaN
3  4  8  4.0

mask2 = pd.Series([True, False, False, False])
df['c'] = df['c'].add(df.loc[mask2, 'b'], fill_value=0)
    
print (df)
   a  b    c
0  1  5  5.0
1  2  6  2.0
2  3  7  NaN
3  4  8  4.0

Upvotes: 2

Pierre D
Pierre D

Reputation: 26221

Try to always match the lvalue's index with the rvalue's one. Otherwise, pandas does its best to reindex the two before doing your operation.

Also, always explicitly set a new column to some default value of your choice, e.g. 0, before doing stride assignments (it will also give you the correct dtype).

So, concretely:

some_dict = {'a': [1,2,3,4], 'b': [5,6,7,8],}
df = pd.DataFrame(some_dict)
mask1 = pd.Series([False, True, False, True])
df['c'] = 0
df.loc[mask1, 'c'] = df.loc[mask1, 'a']

mask2 = pd.Series([True, False, False, False])
df.loc[mask2, 'c'] += df.loc[mask2, 'b']

print(df)
   a  b  c
0  1  5  5
1  2  6  2
2  3  7  0
3  4  8  4

Upvotes: 1

Related Questions