Rob
Rob

Reputation: 75

Select subset of dataframe using condition on multi index

All,

I have a dataframe in the following format:

ind date value1 value2 x1 23-04-2018 1.3 7.2 x1 03-05-2018 4.6 3.5 x2 04-04-2018 2.0 8.5

ind and date are the indices. If there are multiple rows for one value of ind I only want to keep the latest date.

So, in my example the preferred result is:

ind date value1 value2 x1 03-05-2018 4.6 3.5 x2 04-04-2018 2.0 8.5

The date column is in datetime format, so finding the max date within groups could work. But is it possible to use such a condition, for example in combination with a groupby (like GROUPBY and HAVING in SQL). Or is there a better method?

Does anyone have a solution or tips?

Edit: I slightly changed my example. Multiple rows are not necessarily duplicates.

Upvotes: 1

Views: 433

Answers (2)

BallpointBen
BallpointBen

Reputation: 13820

A somewhat verbose method that requires no sorting:

# Gets the row indices as list of lists
idx = df.reset_index('date', drop=False) \
    .groupby('ind', sort=False)['date'] \
    .max() \
    .iteritems()

df.loc(axis=0)[list(idx)]

This relies on the order of ['idx', 'date'] in df.

Upvotes: 0

Scott Boston
Scott Boston

Reputation: 153500

You can use sort_index and drop duplicates.

Where df has no indexes at first:

print(df)

  ind       date  value1  value2
0  x1 2018-04-23     1.3     7.2
1  x1 2018-03-05     1.3     7.2
2  x2 2018-04-04     2.0     8.5

df.set_index(['ind','date']).sort_index(level=[1], ascending=[False]).drop_duplicates()

Or as @piRSquare states:

df.set_index(['ind','date']).sort_index(level=[1]).drop_duplicates(keep='last')

Output:

                value1  value2
ind date                      
x1  2018-04-23     1.3     7.2
x2  2018-04-04     2.0     8.5

Upvotes: 3

Related Questions