geekshock
geekshock

Reputation: 31

Pandas rolling window statistics calculation with input data with uneven timestamps

quick background:

what I want to achieve:

ts                          var1
2022-07-12 07:00:00+00:00   NaN
2022-07-12 07:01:53+00:00   1.5
2022-07-12 07:03:17+00:00   2.2
2022-07-12 07:04:02+00:00   0.9
2022-07-12 07:04:59+00:00   3.6
2022-07-12 07:05:00+00:00   NaN
2022-07-12 07:06:22+00:00   3.3
2022-07-12 07:09:46+00:00   2.3
2022-07-12 07:10:00+00:00   NaN
2022-07-12 07:11:22+00:00   1.3
2022-07-12 07:13:44+00:00   4.3
2022-07-12 07:14:26+00:00   4.1
2022-07-12 07:15:00+00:00   NaN
ts                          var1_max    var1_min
2022-07-12 07:00:00+00:00   3.6         0.9
2022-07-12 07:01:00+00:00   3.6         0.9
2022-07-12 07:02:00+00:00   3.6         0.9
2022-07-12 07:03:00+00:00   3.6         0.9
2022-07-12 07:04:00+00:00   3.6         0.9
2022-07-12 07:05:00+00:00   3.3         2.3
2022-07-12 07:06:00+00:00   3.3         2.3
2022-07-12 07:07:00+00:00   2.3         1.3
2022-07-12 07:08:00+00:00   2.3         1.3
2022-07-12 07:09:00+00:00   4.3         1.3
2022-07-12 07:10:00+00:00   4.3         1.3
2022-07-12 07:11:00+00:00   4.3         1.3
2022-07-12 07:12:00+00:00   4.3         4.1
2022-07-12 07:13:00+00:00   4.3         4.1
2022-07-12 07:14:00+00:00   4.1         4.1
2022-07-12 07:15:00+00:00   NaN         NaN
ts                          var1_max    va1_min
2022-07-12 07:00:00+00:00   3.6         0.9
2022-07-12 07:01:00+00:00   3.6         0.9
2022-07-12 07:02:00+00:00   NaN         NaN
2022-07-12 07:03:00+00:00   3.6         0.9
2022-07-12 07:04:00+00:00   3.6         0.9
2022-07-12 07:05:00+00:00   3.3         2.3
2022-07-12 07:06:00+00:00   3.3         2.3
2022-07-12 07:07:00+00:00   NaN         NaN
2022-07-12 07:08:00+00:00   NaN         NaN
2022-07-12 07:09:00+00:00   4.3         1.3
2022-07-12 07:10:00+00:00   4.3         1.3
2022-07-12 07:11:00+00:00   4.3         1.3
2022-07-12 07:12:00+00:00   NaN         NaN
2022-07-12 07:13:00+00:00   4.3         4.1
2022-07-12 07:14:00+00:00   4.1         4.1
2022-07-12 07:15:00+00:00   NaN         NaN

key issues:

example code:

#########################################################################
# input data
data_test = [
        [
            "01:00:00",
            "01:01:53",
            "01:03:17",
            "01:04:02",
            "01:04:59",
            "01:05:00",
            "01:06:22",
            "01:09:46",
            "01:10:00",
            "01:11:22",
            "01:13:44",
            "01:14:26",
            "01:15:00"
        ],
        [np.nan, 1.5, 2.2, 0.9, 3.6, np.nan, 3.3, 2.3, np.nan, 1.3, 4.3, 4.1, np.nan]
    ]

data_test = pd.DataFrame(data_test).T
data_test.columns = ['ts', 'var1']
data_test['var1'] = data_test['var1'].astype(float)
data_test['ts'] = pd.to_datetime(data_test.ts)
data_test = data_test.set_index('ts')
    
#########################################################################
# reversing for forward looking window operation
data_test = data_test[::-1]
    
#########################################################################
# adding rolling window statistics: minimum
mins = (
    data_test.rolling(
        window="5min", 
        min_periods=1,
        closed='right'
    ).min().add_suffix("_min")
)

# adding rolling window statistics: maximum
maxs = (
    data_test.rolling(
        window="5min", 
        min_periods=1,
        closed='right'
    ).max().add_suffix("_max")
)
    
#########################################################################
# resampling with even interval
mins = mins.resample(rule="1min").min()
maxs = maxs.resample(rule="1min").max()

data_test = pd.concat([mins, maxs], axis=1)   
    

Upvotes: 2

Views: 525

Answers (2)

geekshock
geekshock

Reputation: 31

I think I found a solution to generate the output that I want. But unsure if this is the most elegant way. In short, it is basically (1) using resample and agg methods to resample data with respect to each statistic (min, max, sum, count) and (2) apply rolling method to each statistic again to get rolling window statistics. To calculate the rolling mean/average, I'm dividing rolling sum by rolling count.

#########################################################################
# resampling with agg method to retain statistics within time increment
data_resample_min = data_test.resample(rule='1min').agg(['min'])
data_resample_max = data_test.resample(rule='1min').agg(['max'])
data_resample_sum = data_test.resample(rule='1min').agg(['sum'])
data_resample_count = data_test.resample(rule='1min').agg(['count'])

#########################################################################
# reversing dataframe for forward-looking rolling window operation
data_resample_min = data_resample_min[::-1]
data_resample_max = data_resample_max[::-1]
data_resample_sum = data_resample_sum[::-1]
data_resample_count = data_resample_count[::-1]

#########################################################################
# merging multiindex column header
data_resample_min.columns = data_resample_min.columns.map('_'.join)
data_resample_max.columns = data_resample_max.columns.map('_'.join)
data_resample_sum.columns = data_resample_sum.columns.map('_'.join)
data_resample_count.columns = data_resample_count.columns.map('_'.join)
    
#########################################################################   
# adding rolling window statistics: minimum
mins = (
    data_resample_min.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).min()
)
# adding rolling window statistics: maximum
maxs = (
    data_resample_max.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).max()
)
# adding rolling window statistics: sum
sums = (
    data_resample_sum.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).sum()
)
# adding rolling window statistics: count
counts = (
    data_resample_count.rolling(
        window='5min', 
        min_periods=1,
        center=False,
        closed='right'
    ).sum() # this has to be sum for proper count calculation
)
# adding rolling window statistics: mean
means = sums.copy()
means.columns = means.columns.str.replace("_sum","_mean")
# supress/hide the warning
np.seterr(invalid='ignore')
means.loc[:, :] = sums.values/counts.values

#########################################################################
# creating summary dataframe
data_test = pd.concat([mins, maxs, means], axis=1)  
data_test = data_test[::-1]

Upvotes: 1

BeRT2me
BeRT2me

Reputation: 13242

Given the following, where I've already converted ts to datetime and made it the index:

                           var1
ts
2022-07-12 07:00:00+00:00   NaN
2022-07-12 07:01:53+00:00   1.5
2022-07-12 07:03:17+00:00   2.2
2022-07-12 07:04:02+00:00   0.9
2022-07-12 07:04:59+00:00   3.6
2022-07-12 07:05:00+00:00   NaN
2022-07-12 07:06:22+00:00   3.3
2022-07-12 07:09:46+00:00   2.3
2022-07-12 07:10:00+00:00   NaN
2022-07-12 07:11:22+00:00   1.3
2022-07-12 07:13:44+00:00   4.3
2022-07-12 07:14:26+00:00   4.1
2022-07-12 07:15:00+00:00   NaN

Doing:

df['var1_min'] = df.resample('5min')['var1'].transform('min')
df['var1_max'] = df.resample('5min')['var1'].transform('max')
df = df.asfreq('1min', method='ffill')

Output:

                           var1  var1_min  var1_max
2022-07-12 07:00:00+00:00   NaN       0.9       3.6
2022-07-12 07:01:00+00:00   NaN       0.9       3.6
2022-07-12 07:02:00+00:00   1.5       0.9       3.6
2022-07-12 07:03:00+00:00   1.5       0.9       3.6
2022-07-12 07:04:00+00:00   2.2       0.9       3.6
2022-07-12 07:05:00+00:00   NaN       2.3       3.3
2022-07-12 07:06:00+00:00   NaN       2.3       3.3
2022-07-12 07:07:00+00:00   3.3       2.3       3.3
2022-07-12 07:08:00+00:00   3.3       2.3       3.3
2022-07-12 07:09:00+00:00   3.3       2.3       3.3
2022-07-12 07:10:00+00:00   NaN       1.3       4.3
2022-07-12 07:11:00+00:00   NaN       1.3       4.3
2022-07-12 07:12:00+00:00   1.3       1.3       4.3
2022-07-12 07:13:00+00:00   1.3       1.3       4.3
2022-07-12 07:14:00+00:00   4.3       1.3       4.3
2022-07-12 07:15:00+00:00   NaN       NaN       NaN

Upvotes: 2

Related Questions