Pandas timeseries resampling after groupby to include missing intervals when some initial (or final) values are missing

Question

I have a timeseries dataset (at 1-minute) with some subgroups, and want to resample it to a lower frequency (say, 5-minutes). But within some of these groups, the data does not start at the same timestamp that I am interested in as the starting time for all resampled output for all the groups. Here is a created dataset to explain the situation:

import pandas as pd

rng1 = pd.date_range('2000-01-01', periods=10, freq='T')
ts1 = pd.DataFrame(data={'factor': ['A']*10, 'value': [1]*10, 'dummy':np.arange(10)}, index=rng1)
rng2 = pd.date_range('2000-01-01 00:06', periods=10, freq='T')
ts2 = pd.DataFrame(data={'factor': ['B']*10, 'value': [1]*10, 'dummy':np.arange(10)}, index=rng2)
ts3 = ts1.append(ts2)
ts3.groupby(by='factor').resample('5min').sum()

So, here's how the dataset looks at first (ts3):

                     dummy factor  value
2000-01-01 00:00:00      0      A      1
2000-01-01 00:01:00      1      A      1
2000-01-01 00:02:00      2      A      1
2000-01-01 00:03:00      3      A      1
2000-01-01 00:04:00      4      A      1
2000-01-01 00:05:00      5      A      1
2000-01-01 00:06:00      6      A      1
2000-01-01 00:07:00      7      A      1
2000-01-01 00:08:00      8      A      1
2000-01-01 00:09:00      9      A      1
2000-01-01 00:06:00      0      B      1
2000-01-01 00:07:00      1      B      1
2000-01-01 00:08:00      2      B      1
2000-01-01 00:09:00      3      B      1
2000-01-01 00:10:00      4      B      1
2000-01-01 00:11:00      5      B      1
2000-01-01 00:12:00      6      B      1
2000-01-01 00:13:00      7      B      1
2000-01-01 00:14:00      8      B      1
2000-01-01 00:15:00      9      B      1

And the current output is like this (as you can see for group B, the earliest timestamp in the raw data is '2001-01-01 00:06:00' and hence the first resampled timestamp is '2001-01-01 00:05:00', which is completely understandable; so also the last timestamp in the raw data is '2001-01-01 00:09:00', so the last resampled timestamp for group A is '2001-01-01 00:05:00'):

                            dummy  value
factor                                  
A      2000-01-01 00:00:00     10      5
       2000-01-01 00:05:00     35      5
B      2000-01-01 00:05:00      6      4
       2000-01-01 00:10:00     30      5
       2000-01-01 00:15:00      9      1

My desired output is as follows (because the data itself is missing for factor B group in this case, for the first 6 minutes; and the last two 5-minute intervals when factor B group has data, there is no equivalent data in factor A group):

                            dummy  value
factor                                  
A      2000-01-01 00:00:00     10      5
       2000-01-01 00:05:00     35      5
       2000-01-01 00:10:00 np.NaN np.NaN
       2000-01-01 00:15:00 np.NaN np.NaN
B      2000-01-01 00:00:00 np.NaN np.NaN
       2000-01-01 00:05:00      6      4
       2000-01-01 00:10:00     30      5
       2000-01-01 00:15:00      9      1

I am not seeing any documentation or examples of such usage of resample(), i.e. to specify a start/end date_time. Any suggestions or pointers?

The brute approach I am currently considering is to create a DatetimeIndex for the entire period of interest at 1-minute frequency, and join it with my "groups" to have a complete initial set of rows; merge it with my dataset; and then resample. It seems like an overkill when I am dealing with millions of rows. Thanks.

BENY · Accepted Answer

IIUC, using unstack and stack

Notice, NaN is consider as float, that why you see your dummy and value became float

ts3.groupby(by='factor').resample('5min').sum().unstack().stack(dropna=False)
Out[407]: 
                            dummy  value
factor                                  
A      2000-01-01 00:00:00   10.0    5.0
       2000-01-01 00:05:00   35.0    5.0
       2000-01-01 00:10:00    NaN    NaN
       2000-01-01 00:15:00    NaN    NaN
B      2000-01-01 00:00:00    NaN    NaN
       2000-01-01 00:05:00    6.0    4.0
       2000-01-01 00:10:00   30.0    5.0
       2000-01-01 00:15:00    9.0    1.0

Pandas timeseries resampling after groupby to include missing intervals when some initial (or final) values are missing

Answers (1)

Related Questions