Select subset of dataframe using condition on multi index

Question

All,

I have a dataframe in the following format:

ind date value1 value2 x1 23-04-2018 1.3 7.2 x1 03-05-2018 4.6 3.5 x2 04-04-2018 2.0 8.5

ind and date are the indices. If there are multiple rows for one value of ind I only want to keep the latest date.

So, in my example the preferred result is:

ind date value1 value2 x1 03-05-2018 4.6 3.5 x2 04-04-2018 2.0 8.5

The date column is in datetime format, so finding the max date within groups could work. But is it possible to use such a condition, for example in combination with a groupby (like GROUPBY and HAVING in SQL). Or is there a better method?

Does anyone have a solution or tips?

Edit: I slightly changed my example. Multiple rows are not necessarily duplicates.

Scott Boston · Accepted Answer

You can use sort_index and drop duplicates.

Where df has no indexes at first:

print(df)

  ind       date  value1  value2
0  x1 2018-04-23     1.3     7.2
1  x1 2018-03-05     1.3     7.2
2  x2 2018-04-04     2.0     8.5

df.set_index(['ind','date']).sort_index(level=[1], ascending=[False]).drop_duplicates()

Or as @piRSquare states:

df.set_index(['ind','date']).sort_index(level=[1]).drop_duplicates(keep='last')

Output:

                value1  value2
ind date                      
x1  2018-04-23     1.3     7.2
x2  2018-04-04     2.0     8.5

Select subset of dataframe using condition on multi index

Answers (2)

Related Questions