Alex
Alex

Reputation: 19873

Boolean selection and masked assignment

Newbie pandas question regarding boolean selection in DataFrame. Say I have the following in which I would like to take all entries that are > 1 and set them to 3

import numpy as np
import pandas as pd

s = pd.DataFrame(data=np.random.randn(10, 4), index=np.arange(10),
              columns=["a", "b", "c", "d"])

s[np.abs(s) > 1] = np.sign(s) * 3

The RHS is of different shape than LHS, how come this works fine and I don't have to do

s[np.abs(s) > 1] = np.sign(s[np.abs(s) > 1]) * 3

My understanding is that the LHS of both of these expressions returns a view on to the elements where the expression in brackets evaluates to True. However, examining the LHS of the first statement shows that it returns NaN for elements where the selection statement is False. What am I missing?

Upvotes: 2

Views: 776

Answers (1)

John Zwinck
John Zwinck

Reputation: 249592

What you are missing is that an indexing statement in Python can have different meanings based on whether it is on the RHS or LHS of an assignment. In your case:

s[np.abs(s) > 1] = np.sign(s) * 3

This results in a call to pd.DataFrame.__setitem__(s, np.abs(s) > 1, np.sign(s) * 3). And sincenp.abs(s) > 1returns True only in certain cells, Pandas implementssetitem()` to only modify those cells. This is just a useful convention--nothing in the Python language enforces it per se.

On the other hand if you say:

print(s[np.abs(s) > 1])

This results in a call to pd.DataFrame.__getitem__(s, np.abs(s) > 1). And Pandas implements this by returning a DataFrame with the same shape as s but with the "missing" values filled with NAN.

So when you do the assignment, don't imagine that Pandas is creating a DataFrame with NAN values where the condition is False, then assigning to that DataFrame. That's not what happens. It just copies the values from the RHS to the LHS wherever the condition is True.

Upvotes: 4

Related Questions