RageQuilt
RageQuilt

Reputation: 359

Removing values in dataframe once threshold (min/max) value has been reached with Pandas

I would like to make a filter for the entire dataframe, which includes many columns beyond column C. I'd like this filter to return values in each column once a minimum threshold value has been reached, and stop when a maximum threshold value has been reached. I'd like the min threshold to be 6.5 and the max to be 9.0. It's not as simple as it sounds here so hang with me...

The dataframe:

Time    A1  A2  A3
1   6.305   6.191   5.918
2   6.507   6.991   6.203
3   6.407   6.901   6.908
4   6.963   7.127   7.116
5   7.227   7.330   7.363
6   7.445   7.632   7.575
7   7.710   7.837   7.663
8   8.904   8.971   8.895
9   9.394   9.194   8.994
10  8.803   8.113   9.333
11  8.783   8.783   8.783

The desired result:

Time    A1  A2  A3
1   NaN     NaN     NaN
2   6.507   6.991   NaN
3   6.407   6.901   6.908
4   6.963   7.127   7.116
5   7.227   7.330   7.363
6   7.445   7.632   7.575
7   7.710   7.837   7.663
8   8.904   8.971   8.895
9   NaN     NaN     8.994
10  NaN     NaN     NaN
11  NaN     NaN     NaN

To drive home the point, in Column A, for example, at Time 3 there is a value 6.407, which is lower than the 6.5 threshold, but since the threshold was met at Time 2, I would like to keep the data once the min threshold has been met. As for the upper threshold, in Column A at Time 9, the value is above the 9.0 threshold, so I would like it to omit that value and the values beyond that, even though the remaining values are less than 9.0. I'm hoping to iterate this over many many more columns.

Thank you!!!

Upvotes: 3

Views: 2505

Answers (2)

Divakar
Divakar

Reputation: 221614

Implementation

Here's a vectorized approach using NumPy boolean indexing -

# Extract values into an array
arr = df.values

# Determine the min,max limits along each column
minl = (arr > 6.5).argmax(0)
maxl = (arr>9).argmax(0)

# Setup corresponding boolean mask and set those in array to be NaNs
R = np.arange(arr.shape[0])[:,None]
mask = (R < minl) | (R >= maxl)
arr[mask] = np.nan

# Finally convert to dataframe
df = pd.DataFrame(arr,columns=df.columns)

Please note that alternatively, one can mask directly into the input dataframe instead of re-creating it, but the interesting find here is that boolean indexing into a NumPy array is faster than into a pandas dataframe. Since, we are filtering the entire dataframe, we can re-create the dataframe.

Closer look

Now, let's take a closer look at the mask creation part, which is the crux of this solution.

1) Input array :

In [148]: arr
Out[148]: 
array([[ 6.305,  6.191,  5.918],
       [ 6.507,  6.991,  6.203],
       [ 6.407,  6.901,  6.908],
       [ 6.963,  7.127,  7.116],
       [ 7.227,  7.33 ,  7.363],
       [ 7.445,  7.632,  7.575],
       [ 7.71 ,  7.837,  7.663],
       [ 8.904,  8.971,  8.895],
       [ 9.394,  9.194,  8.994],
       [ 8.803,  8.113,  9.333],
       [ 8.783,  8.783,  8.783]])

2) Min,max limits :

In [149]: # Determine the min,max limits along each column
     ...: minl = (arr > 6.5).argmax(0)
     ...: maxl = (arr>9).argmax(0)
     ...: 

In [150]: minl
Out[150]: array([1, 1, 2])

In [151]: maxl
Out[151]: array([8, 8, 9])

3) Using broadcasting to create a mask that spans across the entire dataframe/array and selects elements that are to set as NaNs :

In [152]: R = np.arange(arr.shape[0])[:,None]

In [153]: R
Out[153]: 
array([[ 0],
       [ 1],
       [ 2],
       [ 3],
       [ 4],
       [ 5],
       [ 6],
       [ 7],
       [ 8],
       [ 9],
       [10]])

In [154]: (R < minl) | (R >= maxl)
Out[154]: 
array([[ True,  True,  True],
       [False, False,  True],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [False, False, False],
       [ True,  True, False],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)

Runtime test

Let's time the approaches listed thus far to solve the problem and since it was mentioned that we would have many columns, so let's use a decently big number of columns.

Approaches listed as functions :

def cumsum_app(df):    # Listed in other solution by @Merlin
    df2 = df > 6.5 
    df  = df[df2.cumsum()>0]
    df2 = df > 9   
    df  = df[~(df2.cumsum()>0)]

def boolean_indexing_app(df):  # Approaches listed in this post
    arr = df.values
    minl = (arr > 6.5).argmax(0)
    maxl = (arr>9).argmax(0)
    R = np.arange(arr.shape[0])[:,None]
    mask = (R < minl) | (R >= maxl)
    arr[mask] = np.nan
    df = pd.DataFrame(arr,columns=df.columns)

Timings :

In [163]: # Create a random array with floating pt numbers between 6 and 10
     ...: df = pd.DataFrame((np.random.rand(11,10000)*4)+6)
     ...: 
     ...: # Create copies for testing approaches
     ...: df1 = df.copy()
     ...: df2 = df.copy()


In [164]: %timeit cumsum_app(df1)
100 loops, best of 3: 16.4 ms per loop

In [165]: %timeit boolean_indexing_app(df2)
100 loops, best of 3: 2.09 ms per loop

Upvotes: 2

Merlin
Merlin

Reputation: 25659

Try this:

df 
        A1     A2     A3
Time                     
1     6.305  6.191  5.918
2     6.507  6.991  6.203
3     6.407  6.901  6.908
4     6.963  7.127  7.116
5     7.227  7.330  7.363
6     7.445  7.632  7.575
7     7.710  7.837  7.663
8     8.904  8.971  8.895
9     9.394  9.194  8.994
10    8.803  8.113  9.333
11    8.783  8.783  8.783

df2 = df > 6.5 
df  = df[df2.cumsum()>0]
df2 = df > 9   
df  = df[~(df2.cumsum()>0)]

df 
         A1     A2     A3
Time                     
1       NaN    NaN    NaN
2     6.507  6.991    NaN
3     6.407  6.901  6.908
4     6.963  7.127  7.116
5     7.227  7.330  7.363
6     7.445  7.632  7.575
7     7.710  7.837  7.663
8     8.904  8.971  8.895
9       NaN    NaN  8.994
10      NaN    NaN    NaN
11      NaN    NaN    NaN

Upvotes: 2

Related Questions