Reputation: 55
I need to replace values in a dataframe that are lower than a certain value for NaNs.
For example, suppose I need to replace all values higher than 100 for NaN
df = pd.DataFrame({'a':[1,250,480],
'b':[60,51,101],
'c':[15,689,1]})
would become:
({'a':[1,NaN,NaN],
'b':[60,51,NaN],
'c':[15,NaN,1]})
Which should be the best way to do that?
Upvotes: 4
Views: 11803
Reputation: 862611
Use:
df = df.mask(df > 100)
df = df.where(df <= 100)
df = pd.DataFrame(np.where(df > 100, np.nan, df), index=df.index, columns=df.columns)
print (df)
a b c
0 1.0 60.0 15.0
1 NaN 51.0 NaN
2 NaN NaN 1.0
Fast comparison (depends of data):
df = pd.concat([df] * 10000, ignore_index=True)
In [104]: %timeit pd.DataFrame(np.where(df > 100, np.nan, df), index=df.index, columns=df.columns)
The slowest run took 4.37 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 683 µs per loop
In [105]: %timeit df[:] = np.where(df.values <= 100, df.values, np.nan)
__main__:257: RuntimeWarning: invalid value encountered in less_equal
The slowest run took 17.24 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 957 µs per loop
In [106]: %timeit df.mask(df > 100)
1000 loops, best of 3: 1.56 ms per loop
In [107]: %timeit df.where(df <= 100)
The slowest run took 8.01 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.84 ms per loop
In [108]: %timeit df[df<100]
The slowest run took 5.57 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.89 ms per loop
Upvotes: 4
Reputation: 402483
np.where
with in-place update;
df[:] = np.where(df.values <= 100, df.values, np.nan)
df
a b c
0 1.0 60.0 15.0
1 NaN 51.0 NaN
2 NaN NaN 1.0
Upvotes: 3