Reputation: 113

Masking a pandas DataFrame with a numpy array vs DataFrame

I want to use a 2D boolean mask to selectively alter some cells in a pandas DataFrame. I noticed that I cannot use a numpy array (successfully) as the mask, but I can use a DataFrame. More frustrating, however, is that I don't get an error with the numpy approach.

For example,

df = pd.DataFrame({'A':[1,2,3,4], 
                   'B':[10,20,30,40]})

mask_np = np.array([[True,True],
                    [False,False],
                    [True,False],
                    [False,True]])

mask_pd = pd.DataFrame(mask_np, columns=['A','B'])

I would think either mask would return the values from df wherever the mask was True. But instead, df[mask_np] produces

which is not what I expect, nor can I explain. On the other hand, df[mask_pd] produces

     A     B
0  1.0  10.0
1  NaN   NaN
2  3.0   NaN
3  NaN  40.0

which is what I expect and want.

Why can't I use the numpy mask? My internet search turned up nothing relevant. Any explanation behind this difference would be greatly appreciated!

[pandas version 0.20.3; Python 3.6.3]

Upvotes: 6

Answers (2)

Andrey Portnoy

Reputation: 1509

Write down the row indices of the True's in your mask_np: row 0, row 0, row 2, row 3. Select the rows with the same indices in df and concatenate them. That's how df[mask_np] is produced.

This is probably a Pandas bug, since it's assumed in the source code that the array used for indexing is 1-dimensional.

Looking at the source code (Pandas 0.23.4),

df[mask_np]

is equivalent to

df._getitem_bool_array(mask_np)

is equivalent to

indexer = mask_np.nonzero()[0]
df._take(indexer, axis=0)

with the following evaluation:

>>> mask_np.nonzero()
(array([0, 0, 2, 3]), array([0, 1, 0, 1]))

This tuple of arrays represents indices of nonzero elements along the dimensions of the array. In this case, the elements of first array in the tuple (eventually used in df._take) are 'row' indices of True's in mask_df.

The first array is used to take along the index, so you get rows 0, 0, 2, 3 of df in return.

Upvotes: 1

jpp

Reputation: 164693

The source code suggests why. The __getitem__ method, for which [] is syntactic sugar, checks specifically for indexing via a dataframe:

elif isinstance(key, DataFrame):
    return self._getitem_frame(key)

The _getitem_frame method called then returns pd.DataFrame.where if the dataframe is of Boolean type:

def _getitem_frame(self, key):
    if key.values.size and not is_bool_dtype(key.values):
        raise ValueError('Must pass DataFrame with boolean values only')
    return self.where(key)

The route taken for NumPy arrays, _getitem_array, is different and more convoluted. For some reason, the code is designed to treat NumPy / Pandas inputs differently, rather than to ensure consistency for the same data types.

Regular Boolean indexing with a Pandas dataframe is usually applied along an axis, i.e. by rows / axis 0 via df.loc[mask, :] or columns / axis 1 via df.loc[:, mask].

Note you can, and probably should, access pd.DataFrame.where directly for clarity:

res = df.where(mask_np)

print(res)

     A     B
0  1.0  10.0
1  NaN   NaN
2  3.0   NaN
3  NaN  40.0

Upvotes: 3

Masking a pandas DataFrame with a numpy array vs DataFrame

Answers (2)

Related Questions