Reputation: 11
I have a dataframe with some data that I'm going to run simulations on. Each row is a datetime and a value. Because of the nature of the problem, I need to keep the original frequency of 1 hour when the value is above a certain threshold. When it's not, I could resample the data and run that part of the simulation on lower frequency data, in order to speed up the simulation.
My idea is to somehow group the dataframe by day (since I've noticed there are many whole days where the value stays below the threshold), check the max value over each group, and if the max is below the threshold then aggregate the data in that group into a single mean value.
Here's a minimal working example:
import pandas as pd
import numpy as np
threshold = 3
idx = pd.date_range("2018-01-01", periods=27, freq="H")
df = pd.Series(np.append(np.ones(26), 5), index=idx).to_frame("v")
print(df)
Output:
v
2018-01-01 00:00:00 1.0
2018-01-01 01:00:00 1.0
2018-01-01 02:00:00 1.0
2018-01-01 03:00:00 1.0
2018-01-01 04:00:00 1.0
2018-01-01 05:00:00 1.0
2018-01-01 06:00:00 1.0
2018-01-01 07:00:00 1.0
2018-01-01 08:00:00 1.0
2018-01-01 09:00:00 1.0
2018-01-01 10:00:00 1.0
2018-01-01 11:00:00 1.0
2018-01-01 12:00:00 1.0
2018-01-01 13:00:00 1.0
2018-01-01 14:00:00 1.0
2018-01-01 15:00:00 1.0
2018-01-01 16:00:00 1.0
2018-01-01 17:00:00 1.0
2018-01-01 18:00:00 1.0
2018-01-01 19:00:00 1.0
2018-01-01 20:00:00 1.0
2018-01-01 21:00:00 1.0
2018-01-01 22:00:00 1.0
2018-01-01 23:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0
The desired output of the operation would be this dataframe:
v
2018-01-01 00:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0
where the first value is the mean of the first day.
I think I'm getting close
grouped = df.resample("1D")
for name, group in grouped:
if group["v"].max() <= 3:
group['v'].agg("mean")
but I'm unsure how to actually apply the aggregation to the desired groups, and get a dataframe back.
Any help is greatly appreciated.
Upvotes: 0
Views: 60
Reputation: 11
So I found a solution.
grouped = df.resample("1D")
def conditionalAggregation(x):
if x['v'].max() <= 3:
idx = [x.index[0].replace(hour=0, minute=0, second=0, microsecond=0)]
return pd.DataFrame(x['v'].max(), index=idx, columns=['v'])
else:
return x
conditionallyAggregated = grouped.apply(conditionalAggregation)
conditionallyAggregated = conditionallyAggregated.droplevel(level=0)
conditionallyAggregated
This gives the following df:
v
2018-01-01 00:00:00 1.0
2018-01-02 00:00:00 1.0
2018-01-02 01:00:00 1.0
2018-01-02 02:00:00 5.0
Upvotes: 1