Alex Poca
Alex Poca

Reputation: 2566

pandas groupby: TOP 3 values in each group and store in DataFrame

This is a followup to pandas groupby: TOP 3 values for each group The solution described there is perfect if the number of rows in each group is at least 3, but it fails if at least one of the groups isn't big enough.

I propose here a new data set that requires another solution.

Some data are saved at random times and I need to find the highest 3 values for each hour:

                     VAL
TIME                    
2017-12-08 00:55:00   29
2017-12-08 01:10:00   56
2017-12-08 01:25:00   82
2017-12-08 01:40:00   13
2017-12-08 01:55:00   35
2017-12-08 02:10:00   53
2017-12-08 02:25:00   25
2017-12-08 02:40:00   23
2017-12-08 02:55:00   21
2017-12-08 03:10:00   12
2017-12-08 03:25:00   15

it should return this DataFrame, without the time when a max was detected:

                     VAL1  VAL2  VAL3
TIME 
2017-12-08 00:00:00   29   None  None
2017-12-08 01:00:00   82    56    35
2017-12-08 02:00:00   53    25    23
2017-12-08 03:00:00   15    12   None

None are in groups where less than 3 rows are available.

The code to generate the data set is:

from datetime import *
import pandas as pd
import numpy as np

df = pd.DataFrame()

date_ref = datetime(2017,12,8,0,55,0)
days = pd.date_range(date_ref, date_ref + timedelta(0.11), freq='15min')

np.random.seed(seed=1111)
data1 = np.random.randint(1, high=100, size=len(days))

df = pd.DataFrame({'TIME': days, 'VAL': data1})
df = df.set_index('TIME')

# groupby
group1 = df.groupby(pd.Grouper(freq='1H'))
largest3 = pd.DataFrame(group1["VAL"].nlargest(3))

My question is how is it possible to save these values into a new DataFrame, perhaps getting them from largest3:

                                         VAL
TIME                TIME                    
2017-12-08 00:00:00 2017-12-08 00:55:00   29
2017-12-08 01:00:00 2017-12-08 01:25:00   82
                    2017-12-08 01:10:00   56
                    2017-12-08 01:55:00   35
2017-12-08 02:00:00 2017-12-08 02:10:00   53
                    2017-12-08 02:25:00   25
                    2017-12-08 02:40:00   23
2017-12-08 03:00:00 2017-12-08 03:25:00   15
                    2017-12-08 03:10:00   12

EDIT: Added reset_index

largest3 = pd.DataFrame(group1["VAL"].nlargest(3)).reset_index(level=1, drop=True)

returns a better overview but I don't know how to move on from here:

                     VAL
TIME                    
2017-12-08 00:00:00   29
2017-12-08 01:00:00   82
2017-12-08 01:00:00   56
2017-12-08 01:00:00   35
2017-12-08 02:00:00   53
2017-12-08 02:00:00   25
2017-12-08 02:00:00   23
2017-12-08 03:00:00   15
2017-12-08 03:00:00   12

Upvotes: 2

Views: 3486

Answers (1)

Alex Poca
Alex Poca

Reputation: 2566

The trick is to create an index that is not based on set_index+modulus, and cumcount provides a progressive counter inside a group:

largest3 = (pd.DataFrame(group1["VAL"]
    .nlargest(3))
    .reset_index(level=1, drop=True))

largest3['index'] = largest3.groupby('TIME').cumcount()  # temporary index

largest3 = (largest3.set_index("index", append=True)['VAL']
    .unstack()
    .add_prefix('VAL'))

The result is, as requested:

index                VAL0  VAL1  VAL2
TIME                                 
2017-12-08 00:00:00  29.0   NaN   NaN
2017-12-08 01:00:00  82.0  56.0  35.0
2017-12-08 02:00:00  53.0  25.0  23.0
2017-12-08 03:00:00  15.0  12.0   NaN

Upvotes: 2

Related Questions