Celius Stingher
Celius Stingher

Reputation: 18367

Filter rows lesser than the cumulative maximum

I'm trying to filter rows based on a relative simple criteria. If the value for Open is less than the max value for the column until that row, it gets dropped, otherwise the row stays and is the reference value for the new max. This is the starting example dataframe:

import pandas as pd
import numpy as np
d = {'Date':['22-01-2019','23-01-2019','24-01-2019','25-01-2019','26-01-2019'],'Open': [40,54,54,79,67], 'Close': [43,53,65,65,61]}
df = pd.DataFrame(data=d)
print(df)

In this case I would like to do the filtering on the column Open:

         Date  Open  Close
0  22-01-2019    40     43 #Max is 40
1  23-01-2019    54     53 #54 is higher than 40 so it stays
2  24-01-2019    54     65 #This is not higher than the previous max, should get dropped
3  25-01-2019    79     80 #This is higher than 54, so it stays
4  26-01-2019    67     61 #This is not higher than 79, should get dropped

The only way I could come up to solve the problem with a for loop iterating over each row in particular, defining an auxiliary variable that records does comparison, and returns a boolean series. However it's extremely inefficient when dealing with more than 100k rows. The final goal is to perform the same filter on the Close column and join them to know in which days (the original data is every 15 minutes) both Open and Close values have risen above the highest value ever (previously) recorded.

Finally the output should look like this:

         Date  Open  Close
0  22-01-2019    40     43
1  23-01-2019    54     53
3  25-01-2019    79     80

If doing the same operation for the Close column it should look like:

         Date  Open  Close
0  22-01-2019    40     43
1  23-01-2019    54     53
2  24-01-2019    54     65
3  25-01-2019    79     80

The final goal (which I would know how to do once the I can get through the filtering part, but just sharing for the sake of the full case) is:

         Date  Open  Close
0  22-01-2019    40     43
1  23-01-2019    54     53
3  25-01-2019    79     80

My solution is:

max_v = 0
list_for_filtering = []
for i, value in df.iterrows():
        if value['Open'] > max_v:
            max_v = value['Open']
            list_for_filtering.append(True)
        else:
            pass
            list_for_filtering.append(False)
df['T/F'] = list_for_filtering

And filter keeping only the True values

Upvotes: 0

Views: 60

Answers (1)

cs95
cs95

Reputation: 402523

One simple solution is to compare "Open" with the shifted cummax:

# thanks to Andy L. for the simplification!
df[df['Open'] > df['Open'].cummax().shift(fill_value=-np.inf)]

         Date  Open  Close
0  22-01-2019    40     43
1  23-01-2019    54     53
3  25-01-2019    79     65

Where,

df['Open'].cummax().shift()

0     NaN
1    40.0
2    54.0
3    54.0
4    79.0
Name: Open, dtype: float64

Upvotes: 3

Related Questions