Reputation: 75
All,
I have a dataframe in the following format:
ind date value1 value2
x1 23-04-2018 1.3 7.2
x1 03-05-2018 4.6 3.5
x2 04-04-2018 2.0 8.5
ind and date are the indices. If there are multiple rows for one value of ind I only want to keep the latest date.
So, in my example the preferred result is:
ind date value1 value2
x1 03-05-2018 4.6 3.5
x2 04-04-2018 2.0 8.5
The date column is in datetime format, so finding the max date within groups could work. But is it possible to use such a condition, for example in combination with a groupby (like GROUPBY and HAVING in SQL). Or is there a better method?
Does anyone have a solution or tips?
Edit: I slightly changed my example. Multiple rows are not necessarily duplicates.
Upvotes: 1
Views: 433
Reputation: 13820
A somewhat verbose method that requires no sorting:
# Gets the row indices as list of lists
idx = df.reset_index('date', drop=False) \
.groupby('ind', sort=False)['date'] \
.max() \
.iteritems()
df.loc(axis=0)[list(idx)]
This relies on the order of ['idx', 'date']
in df
.
Upvotes: 0
Reputation: 153500
You can use sort_index and drop duplicates.
Where df has no indexes at first:
print(df)
ind date value1 value2
0 x1 2018-04-23 1.3 7.2
1 x1 2018-03-05 1.3 7.2
2 x2 2018-04-04 2.0 8.5
df.set_index(['ind','date']).sort_index(level=[1], ascending=[False]).drop_duplicates()
Or as @piRSquare states:
df.set_index(['ind','date']).sort_index(level=[1]).drop_duplicates(keep='last')
Output:
value1 value2
ind date
x1 2018-04-23 1.3 7.2
x2 2018-04-04 2.0 8.5
Upvotes: 3