DGMS89
DGMS89

Reputation: 1667

Deleting a row in pandas dataframe based on condition

Scenario: I have a dataframe with some nan scattered around. It has multiple columns, the ones of interest are "bid" and "ask"

What I want to do: I want to remove all rows where the bid column value is nan AND the ask column value is nan.

Question: What is the best way to do it?

What I already tried:

ab_df = ab_df[ab_df.bid != 'nan' and ab_df.ask != 'nan']

ab_df = ab_df[ab_df.bid.empty and ab_df.ask.empty] 

ab_df = ab_df[ab_df.bid.notnull and ab_df.ask.notnull]

But none of them work.

Upvotes: 0

Views: 2760

Answers (2)

Stael
Stael

Reputation: 2689

ab_df = ab_df.loc[~ab_df.bid.isnull() | ~ab_df.ask.isnull()]

all this time I've been usign that because i convinced myself that .notnull() didn't exist. TIL.

ab_df = ab_df.loc[ab_df.bid.notnull() | ab_df.ask.notnull()]

The key is & rather than and and | rather than or

I made a mistake earlier using & - this is wrong because you want either bid isn't null OR ask isn't null, using and would give you only the rows where both are not null.

I think you can ab_df.dropna() as well, but i'll have to look it up

EDIT

oddly df.dropna() doesn't seem to support dropping based on NAs in a specific column. I would have thought it did.

based on the other answer I now see it does. It's friday afternoon, ok?

Upvotes: 2

akuiper
akuiper

Reputation: 214927

You need vectorized logical operators & or | (and and or from python are to compare scalars not for pandas Series), to check nan values, you can use isnull and notnull:

To remove all rows where the bid column value is nan AND the ask column value is nan, keep the opposite:

ab_df[ab_df.bid.notnull() | ab_df.ask.notnull()]

Example:

df = pd.DataFrame({
        "bid": [pd.np.nan, 1, 2, pd.np.nan],
        "ask": [pd.np.nan, pd.np.nan, 2, 1]
    })

df[df.bid.notnull() | df.ask.notnull()]

#   ask bid
#1  NaN 1.0
#2  2.0 2.0
#3  1.0 NaN

If you need both columns to be non missing:

df[df.bid.notnull() & df.ask.notnull()]

#   ask bid
#2  2.0 2.0

Another option using dropna by setting the thresh parameter:

df.dropna(subset=['ask', 'bid'], thresh=1)

#   ask bid
#1  NaN 1.0
#2  2.0 2.0
#3  1.0 NaN

df.dropna(subset=['ask', 'bid'], thresh=2)

#   ask bid
#2  2.0 2.0

Upvotes: 3

Related Questions