Pandas DataFrame filtering

Question

Let's say I have a DataFrame with four columns, each of which has a threshold value against which I'd like to compare the DataFrame's values.

I would simply like the minimum value of the DataFrame or the threshold.

For example:

df = pd.DataFrame(np.random.randn(100,4), columns=list('ABCD'))

>>> df.head()
          A         B         C         D
0 -2.060410 -1.390896 -0.595792 -0.374427
1  0.660580  0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210  1.711597 -0.063274
4  1.217197  0.202063 -1.407561  0.940371

thresholds = pd.Series({'A': 1, 'B': 1.1, 'C': 1.2, 'D': 1.3})

This solution works (A4 and C3 were filtered), but there must be an easier way:

df_filtered = df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)

>>> df_filtered.head()
          A         B         C         D
0 -2.060410 -1.390896 -0.595792 -0.374427
1  0.660580  0.726795 -1.326431 -1.488186
2 -0.955792 -1.852701 -0.895178 -1.353669
3 -1.002576 -0.321210  1.200000 -0.063274
4  1.000000  0.202063 -1.407561  0.940371

Ideally, I'd like to use .loc to filter in place, but I haven't managed to figure it out. I'm using Pandas 0.14.1 (and can't upgrade).

RESPONSE Below are the timed tests of my initial proposal against the alternatives:

%%timeit
df.lt(thresholds).multiply(df) + df.gt(thresholds).multiply(thresholds)
1000 loops, best of 3: 990 µs per loop

%%timeit
np.minimum(df, thresholds)  # <--- Simple, fast, and returns DataFrame!
10000 loops, best of 3: 110 µs per loop

%%timeit
df[df < thresholds].fillna(thresholds, inplace=True)
1000 loops, best of 3: 1.36 ms per loop

JohnE · Accepted Answer

This is pretty fast (and returns a dataframe):

np.minimum( df, [1.0,1.1,1.2,1.3] )

A pleasant surprise that numpy is so amenable to this without any reshaping or explicit conversions...

Pandas DataFrame filtering

Answers (2)

Related Questions