alex3465
alex3465

Reputation: 419

Filter rows with list of strings and NaN altogether in pandas

I am trying to filter the rows having strings and NaN values. The following code does not work

   list=['ok','good','fine']
    
   df_new = df[(df['b'].str.contains('|'.join(list))) | (df.b.isnull())]

Example

df = pd.DataFrame({'a':[1,2,np.nan, 4, 3 ], 'b':['ok',1,np.nan, np.nan,'fine' ]})
df

        a   b
   0    1.0 ok
   1    2.0 1
   2    NaN NaN
   3    4.0 NaN
   4    3.0 fine

df[(df['b'].str.contains('|'.join(list))) | (df.b.isnull())]

        a   b
   0    1.0 ok
   4    3.0 fine

Desired Output

   0    1.0 ok
   2    NaN NaN
   3    4.0 NaN
   4    3.0 fine

I know there are many way to achieve this but wondering why this code is not working

Upvotes: 0

Views: 73

Answers (1)

9769953
9769953

Reputation: 12251

Your problem is that df['b'].str.contains('|'.join(l)) produces NaNs, since a comparison to NaN here will yield a NaN in the result. That is,

>>> df['b'].str.contains('|'.join(l))
0    True
1     NaN
2     NaN
3     NaN
4    True

If you logical-or(*) that with the df.b.isnull() condition, the NaNs will be interpreted as False (or probably as NaN even, but be equivalent to False):

>>> df['b'].str.contains('|'.join(l)) | df.b.isnull()
0     True
1    False
2    False
3    False
4     True

Hence you lose a few rows.

If you reverse the order for the logical-or, shortcutting for comparisons will kick in: if sides are or-ed, and the left-hand side is True, there is no need to look at the right-hand side (whether it be False, True or NaN), and thus your rows remain:

>>> df.b.isnull() | df['b'].str.contains('|'.join(l))
0     True
1    False
2     True
3     True
4     True

Hence you'll get your result when you logical-or the sides in reverse order:

>>> df[df.b.isnull() | df.b.str.contains('|'.join(l))]
     a     b
0  1.0    ok
2  NaN   NaN
3  4.0   NaN
4  3.0  fine

(You can remove some parentheses, as well as replace the ['b'] part.)


I don't think this is something to rely on, though. You would be better off, for example, by replacing the NaNs with a more relevant value (perhaps an empty string), so that the boolean combination works both ways. Because at some point, a trick like the one above, will bite you.

df.b = df.b.fillna("")

replaces the NaNs with an empty string.


(*) "logical-or" is not a verb that I know of, but for this context, I made it a verb. As some say: verbing weirds language.

Upvotes: 2

Related Questions