Reputation: 1667
Scenario: I have a dataframe with some nan scattered around. It has multiple columns, the ones of interest are "bid" and "ask"
What I want to do: I want to remove all rows where the bid column value is nan AND the ask column value is nan.
Question: What is the best way to do it?
What I already tried:
ab_df = ab_df[ab_df.bid != 'nan' and ab_df.ask != 'nan']
ab_df = ab_df[ab_df.bid.empty and ab_df.ask.empty]
ab_df = ab_df[ab_df.bid.notnull and ab_df.ask.notnull]
But none of them work.
Upvotes: 0
Views: 2760
Reputation: 2689
ab_df = ab_df.loc[~ab_df.bid.isnull() | ~ab_df.ask.isnull()]
all this time I've been usign that because i convinced myself that .notnull()
didn't exist. TIL.
ab_df = ab_df.loc[ab_df.bid.notnull() | ab_df.ask.notnull()]
The key is &
rather than and
and |
rather than or
I made a mistake earlier using &
- this is wrong because you want either bid isn't null OR ask isn't null, using and would give you only the rows where both are not null.
I think you can ab_df.dropna()
as well, but i'll have to look it up
EDIT
oddly df.dropna()
doesn't seem to support dropping based on NAs in a specific column. I would have thought it did.
based on the other answer I now see it does. It's friday afternoon, ok?
Upvotes: 2
Reputation: 214927
You need vectorized logical operators &
or |
(and and or from python are to compare scalars not for pandas Series), to check nan values, you can use isnull
and notnull
:
To remove all rows where the bid column value is nan AND the ask column value is nan, keep the opposite:
ab_df[ab_df.bid.notnull() | ab_df.ask.notnull()]
Example:
df = pd.DataFrame({
"bid": [pd.np.nan, 1, 2, pd.np.nan],
"ask": [pd.np.nan, pd.np.nan, 2, 1]
})
df[df.bid.notnull() | df.ask.notnull()]
# ask bid
#1 NaN 1.0
#2 2.0 2.0
#3 1.0 NaN
If you need both columns to be non missing:
df[df.bid.notnull() & df.ask.notnull()]
# ask bid
#2 2.0 2.0
Another option using dropna
by setting the thresh parameter:
df.dropna(subset=['ask', 'bid'], thresh=1)
# ask bid
#1 NaN 1.0
#2 2.0 2.0
#3 1.0 NaN
df.dropna(subset=['ask', 'bid'], thresh=2)
# ask bid
#2 2.0 2.0
Upvotes: 3