Reputation: 113
I want to use a 2D boolean mask to selectively alter some cells in a pandas
DataFrame
. I noticed that I cannot use a numpy
array (successfully) as the mask, but I can use a DataFrame
. More frustrating, however, is that I don't get an error with the numpy
approach.
For example,
df = pd.DataFrame({'A':[1,2,3,4],
'B':[10,20,30,40]})
mask_np = np.array([[True,True],
[False,False],
[True,False],
[False,True]])
mask_pd = pd.DataFrame(mask_np, columns=['A','B'])
I would think either mask would return the values from df
wherever the mask was True
. But instead, df[mask_np]
produces
A B
0 1 10
0 1 10
2 3 30
3 4 40
which is not what I expect, nor can I explain. On the other hand, df[mask_pd]
produces
A B
0 1.0 10.0
1 NaN NaN
2 3.0 NaN
3 NaN 40.0
which is what I expect and want.
Why can't I use the numpy
mask? My internet search turned up nothing relevant. Any explanation behind this difference would be greatly appreciated!
[pandas
version 0.20.3; Python 3.6.3]
Upvotes: 6
Views: 6185
Reputation: 1509
Write down the row indices of the True
's in your mask_np
: row 0
, row 0
, row 2
, row 3
. Select the rows with the same indices in df
and concatenate them. That's how df[mask_np]
is produced.
This is probably a Pandas bug, since it's assumed in the source code that the array used for indexing is 1-dimensional.
Looking at the source code (Pandas 0.23.4),
df[mask_np]
is equivalent to
df._getitem_bool_array(mask_np)
is equivalent to
indexer = mask_np.nonzero()[0]
df._take(indexer, axis=0)
with the following evaluation:
>>> mask_np.nonzero()
(array([0, 0, 2, 3]), array([0, 1, 0, 1]))
This tuple of arrays represents indices of nonzero elements along the dimensions of the array. In this case, the elements of first array in the tuple (eventually used in df._take
) are 'row' indices of True
's in mask_df
.
The first array is used to take
along the index, so you get rows 0, 0, 2, 3
of df
in return.
Upvotes: 1
Reputation: 164693
The source code suggests why. The __getitem__
method, for which []
is syntactic sugar, checks specifically for indexing via a dataframe:
elif isinstance(key, DataFrame):
return self._getitem_frame(key)
The _getitem_frame
method called then returns pd.DataFrame.where
if the dataframe is of Boolean type:
def _getitem_frame(self, key):
if key.values.size and not is_bool_dtype(key.values):
raise ValueError('Must pass DataFrame with boolean values only')
return self.where(key)
The route taken for NumPy arrays, _getitem_array
, is different and more convoluted. For some reason, the code is designed to treat NumPy / Pandas inputs differently, rather than to ensure consistency for the same data types.
Regular Boolean indexing with a Pandas dataframe is usually applied along an axis, i.e. by rows / axis 0 via df.loc[mask, :]
or columns / axis 1 via df.loc[:, mask]
.
Note you can, and probably should, access pd.DataFrame.where
directly for clarity:
res = df.where(mask_np)
print(res)
A B
0 1.0 10.0
1 NaN NaN
2 3.0 NaN
3 NaN 40.0
Upvotes: 3