Reputation: 2566
This is a followup to pandas groupby: TOP 3 values for each group The solution described there is perfect if the number of rows in each group is at least 3, but it fails if at least one of the groups isn't big enough.
I propose here a new data set that requires another solution.
Some data are saved at random times and I need to find the highest 3 values for each hour:
VAL
TIME
2017-12-08 00:55:00 29
2017-12-08 01:10:00 56
2017-12-08 01:25:00 82
2017-12-08 01:40:00 13
2017-12-08 01:55:00 35
2017-12-08 02:10:00 53
2017-12-08 02:25:00 25
2017-12-08 02:40:00 23
2017-12-08 02:55:00 21
2017-12-08 03:10:00 12
2017-12-08 03:25:00 15
it should return this DataFrame, without the time when a max was detected:
VAL1 VAL2 VAL3
TIME
2017-12-08 00:00:00 29 None None
2017-12-08 01:00:00 82 56 35
2017-12-08 02:00:00 53 25 23
2017-12-08 03:00:00 15 12 None
None
are in groups where less than 3 rows are available.
The code to generate the data set is:
from datetime import *
import pandas as pd
import numpy as np
df = pd.DataFrame()
date_ref = datetime(2017,12,8,0,55,0)
days = pd.date_range(date_ref, date_ref + timedelta(0.11), freq='15min')
np.random.seed(seed=1111)
data1 = np.random.randint(1, high=100, size=len(days))
df = pd.DataFrame({'TIME': days, 'VAL': data1})
df = df.set_index('TIME')
# groupby
group1 = df.groupby(pd.Grouper(freq='1H'))
largest3 = pd.DataFrame(group1["VAL"].nlargest(3))
My question is how is it possible to save these values into a new DataFrame, perhaps getting them from largest3
:
VAL
TIME TIME
2017-12-08 00:00:00 2017-12-08 00:55:00 29
2017-12-08 01:00:00 2017-12-08 01:25:00 82
2017-12-08 01:10:00 56
2017-12-08 01:55:00 35
2017-12-08 02:00:00 2017-12-08 02:10:00 53
2017-12-08 02:25:00 25
2017-12-08 02:40:00 23
2017-12-08 03:00:00 2017-12-08 03:25:00 15
2017-12-08 03:10:00 12
EDIT: Added reset_index
largest3 = pd.DataFrame(group1["VAL"].nlargest(3)).reset_index(level=1, drop=True)
returns a better overview but I don't know how to move on from here:
VAL
TIME
2017-12-08 00:00:00 29
2017-12-08 01:00:00 82
2017-12-08 01:00:00 56
2017-12-08 01:00:00 35
2017-12-08 02:00:00 53
2017-12-08 02:00:00 25
2017-12-08 02:00:00 23
2017-12-08 03:00:00 15
2017-12-08 03:00:00 12
Upvotes: 2
Views: 3486
Reputation: 2566
The trick is to create an index that is not based on set_index
+modulus, and cumcount
provides a progressive counter inside a group:
largest3 = (pd.DataFrame(group1["VAL"]
.nlargest(3))
.reset_index(level=1, drop=True))
largest3['index'] = largest3.groupby('TIME').cumcount() # temporary index
largest3 = (largest3.set_index("index", append=True)['VAL']
.unstack()
.add_prefix('VAL'))
The result is, as requested:
index VAL0 VAL1 VAL2
TIME
2017-12-08 00:00:00 29.0 NaN NaN
2017-12-08 01:00:00 82.0 56.0 35.0
2017-12-08 02:00:00 53.0 25.0 23.0
2017-12-08 03:00:00 15.0 12.0 NaN
Upvotes: 2