Reputation: 383
I have a DataFrame of dates and values (and in the below code, I might not have parsed the dates correctly).
import pandas as pd
d = {'date': pd.Series(['2010-01-01', '2011-01-01',
'2012-01-01', '2012-07-01',
'2013-01-01']),
'value': pd.Series([0, 2, 1, 4, 3])}
df = pd.DataFrame(d)
I'd like a function that can filter this DataFrame to only give me the rows that are "the biggest value I've seen so far" (with respect to dates). In this case, I would end up with 3 rows (the current rows 0, 1, and 3 with the values 0, 2, and 4).
Upvotes: 2
Views: 264
Reputation: 33793
Use cummax
on the 'value' column the get the cumulative max, then compare the cumulative max of the 'value' column to the 'value' column itself, and only keep rows where the 'value' column is equal its cumulative max:
df[df['value'].cummax() == df['value']]
Note that the method described above will include duplicate maximums. For example, if there were an additional row with a value of 4, both rows with 4 would be included in the output.
If you don't want duplicates, you can take a similar approach with cummax
, but only keep rows where the cummax
changes. To get this, use diff
on the cumulative max to get the difference with the previous value, and keep where the difference is positive. Add fillna
with a positive value to keep the first row:
df[df['value'].cummax().diff().fillna(1) > 0]
A slightly simpler approach to remove duplicates would be to just use the first method followed by drop_duplicates
, but depending on your data this might not be as performant:
df[df['value'].cummax() == df['value']].drop_duplicates(subset='value')
The resulting output for your sample data using any method:
date value
0 2010-01-01 0
1 2011-01-01 2
3 2012-07-01 4
Upvotes: 2