Reputation: 31
quick background:
rolling
, resample
, asfreq
, fillna
offset
(e.g., '1T', '5min', etc.) as an input to methods.rolling
method so there is this reverse of dataframe applied in the code.what I want to achieve:
ts var1
2022-07-12 07:00:00+00:00 NaN
2022-07-12 07:01:53+00:00 1.5
2022-07-12 07:03:17+00:00 2.2
2022-07-12 07:04:02+00:00 0.9
2022-07-12 07:04:59+00:00 3.6
2022-07-12 07:05:00+00:00 NaN
2022-07-12 07:06:22+00:00 3.3
2022-07-12 07:09:46+00:00 2.3
2022-07-12 07:10:00+00:00 NaN
2022-07-12 07:11:22+00:00 1.3
2022-07-12 07:13:44+00:00 4.3
2022-07-12 07:14:26+00:00 4.1
2022-07-12 07:15:00+00:00 NaN
ts var1_max var1_min
2022-07-12 07:00:00+00:00 3.6 0.9
2022-07-12 07:01:00+00:00 3.6 0.9
2022-07-12 07:02:00+00:00 3.6 0.9
2022-07-12 07:03:00+00:00 3.6 0.9
2022-07-12 07:04:00+00:00 3.6 0.9
2022-07-12 07:05:00+00:00 3.3 2.3
2022-07-12 07:06:00+00:00 3.3 2.3
2022-07-12 07:07:00+00:00 2.3 1.3
2022-07-12 07:08:00+00:00 2.3 1.3
2022-07-12 07:09:00+00:00 4.3 1.3
2022-07-12 07:10:00+00:00 4.3 1.3
2022-07-12 07:11:00+00:00 4.3 1.3
2022-07-12 07:12:00+00:00 4.3 4.1
2022-07-12 07:13:00+00:00 4.3 4.1
2022-07-12 07:14:00+00:00 4.1 4.1
2022-07-12 07:15:00+00:00 NaN NaN
ts var1_max va1_min
2022-07-12 07:00:00+00:00 3.6 0.9
2022-07-12 07:01:00+00:00 3.6 0.9
2022-07-12 07:02:00+00:00 NaN NaN
2022-07-12 07:03:00+00:00 3.6 0.9
2022-07-12 07:04:00+00:00 3.6 0.9
2022-07-12 07:05:00+00:00 3.3 2.3
2022-07-12 07:06:00+00:00 3.3 2.3
2022-07-12 07:07:00+00:00 NaN NaN
2022-07-12 07:08:00+00:00 NaN NaN
2022-07-12 07:09:00+00:00 4.3 1.3
2022-07-12 07:10:00+00:00 4.3 1.3
2022-07-12 07:11:00+00:00 4.3 1.3
2022-07-12 07:12:00+00:00 NaN NaN
2022-07-12 07:13:00+00:00 4.3 4.1
2022-07-12 07:14:00+00:00 4.1 4.1
2022-07-12 07:15:00+00:00 NaN NaN
key issues:
NaN
NaN
should be informed by rolling window calculation from the raw dataresample
(or asfreq
) before the rolling window calculation, but that loses some valuable information along the process when I have two values within 1 minute time frame. For example, because the input data has values of 3.6 and 0.9 between 07:04-07:05, doing resampling will not capture min (=0.9) and max (=3.6) properly at the end.example code:
#########################################################################
# input data
data_test = [
[
"01:00:00",
"01:01:53",
"01:03:17",
"01:04:02",
"01:04:59",
"01:05:00",
"01:06:22",
"01:09:46",
"01:10:00",
"01:11:22",
"01:13:44",
"01:14:26",
"01:15:00"
],
[np.nan, 1.5, 2.2, 0.9, 3.6, np.nan, 3.3, 2.3, np.nan, 1.3, 4.3, 4.1, np.nan]
]
data_test = pd.DataFrame(data_test).T
data_test.columns = ['ts', 'var1']
data_test['var1'] = data_test['var1'].astype(float)
data_test['ts'] = pd.to_datetime(data_test.ts)
data_test = data_test.set_index('ts')
#########################################################################
# reversing for forward looking window operation
data_test = data_test[::-1]
#########################################################################
# adding rolling window statistics: minimum
mins = (
data_test.rolling(
window="5min",
min_periods=1,
closed='right'
).min().add_suffix("_min")
)
# adding rolling window statistics: maximum
maxs = (
data_test.rolling(
window="5min",
min_periods=1,
closed='right'
).max().add_suffix("_max")
)
#########################################################################
# resampling with even interval
mins = mins.resample(rule="1min").min()
maxs = maxs.resample(rule="1min").max()
data_test = pd.concat([mins, maxs], axis=1)
Upvotes: 2
Views: 525
Reputation: 31
I think I found a solution to generate the output that I want. But unsure if this is the most elegant way. In short, it is basically (1) using resample
and agg
methods to resample data with respect to each statistic (min
, max
, sum
, count
) and (2) apply rolling method to each statistic again to get rolling window statistics. To calculate the rolling mean/average, I'm dividing rolling sum by rolling count.
#########################################################################
# resampling with agg method to retain statistics within time increment
data_resample_min = data_test.resample(rule='1min').agg(['min'])
data_resample_max = data_test.resample(rule='1min').agg(['max'])
data_resample_sum = data_test.resample(rule='1min').agg(['sum'])
data_resample_count = data_test.resample(rule='1min').agg(['count'])
#########################################################################
# reversing dataframe for forward-looking rolling window operation
data_resample_min = data_resample_min[::-1]
data_resample_max = data_resample_max[::-1]
data_resample_sum = data_resample_sum[::-1]
data_resample_count = data_resample_count[::-1]
#########################################################################
# merging multiindex column header
data_resample_min.columns = data_resample_min.columns.map('_'.join)
data_resample_max.columns = data_resample_max.columns.map('_'.join)
data_resample_sum.columns = data_resample_sum.columns.map('_'.join)
data_resample_count.columns = data_resample_count.columns.map('_'.join)
#########################################################################
# adding rolling window statistics: minimum
mins = (
data_resample_min.rolling(
window='5min',
min_periods=1,
center=False,
closed='right'
).min()
)
# adding rolling window statistics: maximum
maxs = (
data_resample_max.rolling(
window='5min',
min_periods=1,
center=False,
closed='right'
).max()
)
# adding rolling window statistics: sum
sums = (
data_resample_sum.rolling(
window='5min',
min_periods=1,
center=False,
closed='right'
).sum()
)
# adding rolling window statistics: count
counts = (
data_resample_count.rolling(
window='5min',
min_periods=1,
center=False,
closed='right'
).sum() # this has to be sum for proper count calculation
)
# adding rolling window statistics: mean
means = sums.copy()
means.columns = means.columns.str.replace("_sum","_mean")
# supress/hide the warning
np.seterr(invalid='ignore')
means.loc[:, :] = sums.values/counts.values
#########################################################################
# creating summary dataframe
data_test = pd.concat([mins, maxs, means], axis=1)
data_test = data_test[::-1]
Upvotes: 1
Reputation: 13242
Given the following, where I've already converted ts
to datetime and made it the index:
var1
ts
2022-07-12 07:00:00+00:00 NaN
2022-07-12 07:01:53+00:00 1.5
2022-07-12 07:03:17+00:00 2.2
2022-07-12 07:04:02+00:00 0.9
2022-07-12 07:04:59+00:00 3.6
2022-07-12 07:05:00+00:00 NaN
2022-07-12 07:06:22+00:00 3.3
2022-07-12 07:09:46+00:00 2.3
2022-07-12 07:10:00+00:00 NaN
2022-07-12 07:11:22+00:00 1.3
2022-07-12 07:13:44+00:00 4.3
2022-07-12 07:14:26+00:00 4.1
2022-07-12 07:15:00+00:00 NaN
Doing:
df['var1_min'] = df.resample('5min')['var1'].transform('min')
df['var1_max'] = df.resample('5min')['var1'].transform('max')
df = df.asfreq('1min', method='ffill')
Output:
var1 var1_min var1_max
2022-07-12 07:00:00+00:00 NaN 0.9 3.6
2022-07-12 07:01:00+00:00 NaN 0.9 3.6
2022-07-12 07:02:00+00:00 1.5 0.9 3.6
2022-07-12 07:03:00+00:00 1.5 0.9 3.6
2022-07-12 07:04:00+00:00 2.2 0.9 3.6
2022-07-12 07:05:00+00:00 NaN 2.3 3.3
2022-07-12 07:06:00+00:00 NaN 2.3 3.3
2022-07-12 07:07:00+00:00 3.3 2.3 3.3
2022-07-12 07:08:00+00:00 3.3 2.3 3.3
2022-07-12 07:09:00+00:00 3.3 2.3 3.3
2022-07-12 07:10:00+00:00 NaN 1.3 4.3
2022-07-12 07:11:00+00:00 NaN 1.3 4.3
2022-07-12 07:12:00+00:00 1.3 1.3 4.3
2022-07-12 07:13:00+00:00 1.3 1.3 4.3
2022-07-12 07:14:00+00:00 4.3 1.3 4.3
2022-07-12 07:15:00+00:00 NaN NaN NaN
Upvotes: 2