Reputation: 18367
I'm trying to filter rows based on a relative simple criteria. If the value for Open
is less than the max
value for the column until that row, it gets dropped, otherwise the row stays and is the reference value for the new max
.
This is the starting example dataframe:
import pandas as pd
import numpy as np
d = {'Date':['22-01-2019','23-01-2019','24-01-2019','25-01-2019','26-01-2019'],'Open': [40,54,54,79,67], 'Close': [43,53,65,65,61]}
df = pd.DataFrame(data=d)
print(df)
In this case I would like to do the filtering on the column Open
:
Date Open Close
0 22-01-2019 40 43 #Max is 40
1 23-01-2019 54 53 #54 is higher than 40 so it stays
2 24-01-2019 54 65 #This is not higher than the previous max, should get dropped
3 25-01-2019 79 80 #This is higher than 54, so it stays
4 26-01-2019 67 61 #This is not higher than 79, should get dropped
The only way I could come up to solve the problem with a for
loop iterating over each row in particular, defining an auxiliary variable that records does comparison, and returns a boolean series. However it's extremely inefficient when dealing with more than 100k rows. The final goal is to perform the same filter on the Close
column and join them to know in which days (the original data is every 15 minutes) both Open and Close values have risen above the highest value ever (previously) recorded.
Finally the output should look like this:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 80
If doing the same operation for the Close
column it should look like:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
2 24-01-2019 54 65
3 25-01-2019 79 80
The final goal (which I would know how to do once the I can get through the filtering part, but just sharing for the sake of the full case) is:
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 80
My solution is:
max_v = 0
list_for_filtering = []
for i, value in df.iterrows():
if value['Open'] > max_v:
max_v = value['Open']
list_for_filtering.append(True)
else:
pass
list_for_filtering.append(False)
df['T/F'] = list_for_filtering
And filter keeping only the True
values
Upvotes: 0
Views: 60
Reputation: 402523
One simple solution is to compare "Open" with the shifted cummax:
# thanks to Andy L. for the simplification!
df[df['Open'] > df['Open'].cummax().shift(fill_value=-np.inf)]
Date Open Close
0 22-01-2019 40 43
1 23-01-2019 54 53
3 25-01-2019 79 65
Where,
df['Open'].cummax().shift()
0 NaN
1 40.0
2 54.0
3 54.0
4 79.0
Name: Open, dtype: float64
Upvotes: 3