Reputation: 303
I have a timeseries dataset (at 1-minute) with some subgroups, and want to resample it to a lower frequency (say, 5-minutes). But within some of these groups, the data does not start at the same timestamp that I am interested in as the starting time for all resampled output for all the groups. Here is a created dataset to explain the situation:
import pandas as pd
rng1 = pd.date_range('2000-01-01', periods=10, freq='T')
ts1 = pd.DataFrame(data={'factor': ['A']*10, 'value': [1]*10, 'dummy':np.arange(10)}, index=rng1)
rng2 = pd.date_range('2000-01-01 00:06', periods=10, freq='T')
ts2 = pd.DataFrame(data={'factor': ['B']*10, 'value': [1]*10, 'dummy':np.arange(10)}, index=rng2)
ts3 = ts1.append(ts2)
ts3.groupby(by='factor').resample('5min').sum()
So, here's how the dataset looks at first (ts3):
dummy factor value
2000-01-01 00:00:00 0 A 1
2000-01-01 00:01:00 1 A 1
2000-01-01 00:02:00 2 A 1
2000-01-01 00:03:00 3 A 1
2000-01-01 00:04:00 4 A 1
2000-01-01 00:05:00 5 A 1
2000-01-01 00:06:00 6 A 1
2000-01-01 00:07:00 7 A 1
2000-01-01 00:08:00 8 A 1
2000-01-01 00:09:00 9 A 1
2000-01-01 00:06:00 0 B 1
2000-01-01 00:07:00 1 B 1
2000-01-01 00:08:00 2 B 1
2000-01-01 00:09:00 3 B 1
2000-01-01 00:10:00 4 B 1
2000-01-01 00:11:00 5 B 1
2000-01-01 00:12:00 6 B 1
2000-01-01 00:13:00 7 B 1
2000-01-01 00:14:00 8 B 1
2000-01-01 00:15:00 9 B 1
And the current output is like this (as you can see for group B, the earliest timestamp in the raw data is '2001-01-01 00:06:00' and hence the first resampled timestamp is '2001-01-01 00:05:00', which is completely understandable; so also the last timestamp in the raw data is '2001-01-01 00:09:00', so the last resampled timestamp for group A is '2001-01-01 00:05:00'):
dummy value
factor
A 2000-01-01 00:00:00 10 5
2000-01-01 00:05:00 35 5
B 2000-01-01 00:05:00 6 4
2000-01-01 00:10:00 30 5
2000-01-01 00:15:00 9 1
My desired output is as follows (because the data itself is missing for factor B group in this case, for the first 6 minutes; and the last two 5-minute intervals when factor B group has data, there is no equivalent data in factor A group):
dummy value
factor
A 2000-01-01 00:00:00 10 5
2000-01-01 00:05:00 35 5
2000-01-01 00:10:00 np.NaN np.NaN
2000-01-01 00:15:00 np.NaN np.NaN
B 2000-01-01 00:00:00 np.NaN np.NaN
2000-01-01 00:05:00 6 4
2000-01-01 00:10:00 30 5
2000-01-01 00:15:00 9 1
I am not seeing any documentation or examples of such usage of resample(), i.e. to specify a start/end date_time. Any suggestions or pointers?
The brute approach I am currently considering is to create a DatetimeIndex for the entire period of interest at 1-minute frequency, and join it with my "groups" to have a complete initial set of rows; merge it with my dataset; and then resample. It seems like an overkill when I am dealing with millions of rows. Thanks.
Upvotes: 0
Views: 131
Reputation: 323226
IIUC, using unstack
and stack
Notice, NaN
is consider as float
, that why you see your dummy
and value
became float
ts3.groupby(by='factor').resample('5min').sum().unstack().stack(dropna=False)
Out[407]:
dummy value
factor
A 2000-01-01 00:00:00 10.0 5.0
2000-01-01 00:05:00 35.0 5.0
2000-01-01 00:10:00 NaN NaN
2000-01-01 00:15:00 NaN NaN
B 2000-01-01 00:00:00 NaN NaN
2000-01-01 00:05:00 6.0 4.0
2000-01-01 00:10:00 30.0 5.0
2000-01-01 00:15:00 9.0 1.0
Upvotes: 1