Reputation: 359
I would like to make a filter for the entire dataframe, which includes many columns beyond column C. I'd like this filter to return values in each column once a minimum threshold value has been reached, and stop when a maximum threshold value has been reached. I'd like the min threshold to be 6.5 and the max to be 9.0. It's not as simple as it sounds here so hang with me...
The dataframe:
Time A1 A2 A3
1 6.305 6.191 5.918
2 6.507 6.991 6.203
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 9.394 9.194 8.994
10 8.803 8.113 9.333
11 8.783 8.783 8.783
The desired result:
Time A1 A2 A3
1 NaN NaN NaN
2 6.507 6.991 NaN
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 NaN NaN 8.994
10 NaN NaN NaN
11 NaN NaN NaN
To drive home the point, in Column A, for example, at Time 3 there is a value 6.407, which is lower than the 6.5 threshold, but since the threshold was met at Time 2, I would like to keep the data once the min threshold has been met. As for the upper threshold, in Column A at Time 9, the value is above the 9.0 threshold, so I would like it to omit that value and the values beyond that, even though the remaining values are less than 9.0. I'm hoping to iterate this over many many more columns.
Thank you!!!
Upvotes: 3
Views: 2505
Reputation: 221614
Implementation
Here's a vectorized approach using NumPy boolean indexing
-
# Extract values into an array
arr = df.values
# Determine the min,max limits along each column
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)
# Setup corresponding boolean mask and set those in array to be NaNs
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan
# Finally convert to dataframe
df = pd.DataFrame(arr,columns=df.columns)
Please note that alternatively, one can mask directly into the input dataframe instead of re-creating it, but the interesting find here is that boolean indexing into a NumPy array is faster than into a pandas dataframe. Since, we are filtering the entire dataframe, we can re-create the dataframe.
Closer look
Now, let's take a closer look at the mask creation part, which is the crux of this solution.
1) Input array :
In [148]: arr
Out[148]:
array([[ 6.305, 6.191, 5.918],
[ 6.507, 6.991, 6.203],
[ 6.407, 6.901, 6.908],
[ 6.963, 7.127, 7.116],
[ 7.227, 7.33 , 7.363],
[ 7.445, 7.632, 7.575],
[ 7.71 , 7.837, 7.663],
[ 8.904, 8.971, 8.895],
[ 9.394, 9.194, 8.994],
[ 8.803, 8.113, 9.333],
[ 8.783, 8.783, 8.783]])
2) Min,max limits :
In [149]: # Determine the min,max limits along each column
...: minl = (arr > 6.5).argmax(0)
...: maxl = (arr>9).argmax(0)
...:
In [150]: minl
Out[150]: array([1, 1, 2])
In [151]: maxl
Out[151]: array([8, 8, 9])
3) Using broadcasting
to create a mask that spans across the entire dataframe/array and selects elements that are to set as NaNs
:
In [152]: R = np.arange(arr.shape[0])[:,None]
In [153]: R
Out[153]:
array([[ 0],
[ 1],
[ 2],
[ 3],
[ 4],
[ 5],
[ 6],
[ 7],
[ 8],
[ 9],
[10]])
In [154]: (R < minl) | (R >= maxl)
Out[154]:
array([[ True, True, True],
[False, False, True],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[False, False, False],
[ True, True, False],
[ True, True, True],
[ True, True, True]], dtype=bool)
Runtime test
Let's time the approaches listed thus far to solve the problem and since it was mentioned that we would have many columns, so let's use a decently big number of columns.
Approaches listed as functions :
def cumsum_app(df): # Listed in other solution by @Merlin
df2 = df > 6.5
df = df[df2.cumsum()>0]
df2 = df > 9
df = df[~(df2.cumsum()>0)]
def boolean_indexing_app(df): # Approaches listed in this post
arr = df.values
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan
df = pd.DataFrame(arr,columns=df.columns)
Timings :
In [163]: # Create a random array with floating pt numbers between 6 and 10
...: df = pd.DataFrame((np.random.rand(11,10000)*4)+6)
...:
...: # Create copies for testing approaches
...: df1 = df.copy()
...: df2 = df.copy()
In [164]: %timeit cumsum_app(df1)
100 loops, best of 3: 16.4 ms per loop
In [165]: %timeit boolean_indexing_app(df2)
100 loops, best of 3: 2.09 ms per loop
Upvotes: 2
Reputation: 25659
Try this:
df
A1 A2 A3
Time
1 6.305 6.191 5.918
2 6.507 6.991 6.203
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 9.394 9.194 8.994
10 8.803 8.113 9.333
11 8.783 8.783 8.783
df2 = df > 6.5
df = df[df2.cumsum()>0]
df2 = df > 9
df = df[~(df2.cumsum()>0)]
df
A1 A2 A3
Time
1 NaN NaN NaN
2 6.507 6.991 NaN
3 6.407 6.901 6.908
4 6.963 7.127 7.116
5 7.227 7.330 7.363
6 7.445 7.632 7.575
7 7.710 7.837 7.663
8 8.904 8.971 8.895
9 NaN NaN 8.994
10 NaN NaN NaN
11 NaN NaN NaN
Upvotes: 2