Reputation: 1205
I am trying to bin a multi-year time series of dates and float values. I'm trying to aggregate each day in to 15 minute bins. So I group the data set by day and then resample in 15 minute increments on each day.
The results seemed odd so I took a closer look at the behaviour of the resampling. The code below summarizes the kind of results I observed (I run it in repl.it)
aindex = pd.to_datetime([
"2013-04-05 04:15:31",
"2013-04-05 05:15:18",
"2013-04-05 05:15:19",
"2013-04-05 05:15:19",
"2013-04-05 05:17:15",
"2013-04-05 07:06:31",
"2013-04-09 04:15:31",
"2013-04-09 05:15:18",
"2013-04-09 05:15:19",
"2013-04-09 05:15:19",
"2013-04-09 05:17:15",
"2013-04-09 07:06:31",
"2013-04-09 07:21:28",
"2013-04-09 09:18:19",
"2013-04-09 09:19:19",
"2013-04-09 09:21:31"])
a = pd.Series([-4.50e+08,
-4.80e+08,
-6.10e+08,
-5.80e+08,
-5.70e+08,
-5.710e+08,
-4.598432e+08,
-4.814140e+08,
-6.109284e+08,
-5.870819e+08,
-5.759888e+08,
-5.713363e+08,
-5.275122e+07,
-2.853787e+08,
-2.523782e+08,
-4.273267e+08],aindex)
print(a)
print(a.groupby(a.index).apply(lambda x: x))
print(a.resample("15T", base=0).apply(lambda x: x))
print(a.groupby(a.index).resample("15T").apply(lambda x: x))
'groupby' behaves as expected but note that each value of 'x' is type pd.Series. 'resample' also returns type pd.Series but appears to miss values when I display it in repl.it or Jupyter but if you change .apply(lambda x: x) to .apply(lambda x: list(x)) you can see there are actually multiple values. 'groupby'+'resample' almost does what I expected ie. each day has 15 minute bins except the indexing is wrong anywhere a 'resample' returned more than one value.
I'm trying to understand what I'm seeing so I can apply the process with confidence. Is this correct behaviour and if so why?
Note: To clarify a bit more my expectations. If I look at the result of a resample for one day then resample includes empty bins:
2013-04-05 04:15:00 -450000000.0
2013-04-05 04:30:00 NaN
2013-04-05 04:45:00 NaN
2013-04-05 05:00:00 NaN
2013-04-05 05:15:00 -570000000.0
2013-04-05 05:30:00 NaN
2013-04-05 05:45:00 NaN
2013-04-05 06:00:00 NaN
2013-04-05 06:15:00 NaN
2013-04-05 06:30:00 NaN
2013-04-05 06:45:00 NaN
2013-04-05 07:00:00 -571000000.0
2013-04-05 07:15:00 NaN
2013-04-05 07:30:00 NaN
2013-04-05 07:45:00 NaN
2013-04-05 08:00:00 NaN
2013-04-05 08:15:00 NaN
2013-04-05 08:30:00 NaN
2013-04-05 08:45:00 NaN
2013-04-05 09:00:00 NaN
2013-04-05 09:15:00 NaN
2013-04-05 09:30:00 NaN
2013-04-05 09:45:00 NaN
2013-04-05 10:00:00 NaN
But if a groupby is done first I don't get empty bins. Why not?:
...
2013-04-05 04:15:31 2013-04-05 04:15:00 -450000000.0
2013-04-05 05:15:18 2013-04-05 05:15:00 -480000000.0
2013-04-05 05:15:19 2013-04-05 05:15:00 -580000000.0
2013-04-05 05:17:15 2013-04-05 05:15:00 -570000000.0
2013-04-05 07:06:31 2013-04-05 07:00:00 -571000000.0
...
Upvotes: 0
Views: 1322
Reputation: 2598
Resample is a tricky function. The main issue with the resampling is that you need to select which value you want to keep (using pandas.DataFrame.last
or pandas.DataFrame.first
).
So doing:
> a.resample("15T", base=0).last()
2013-04-05 04:15:00 -450000000.0
2013-04-05 04:30:00 NaN
2013-04-05 04:45:00 NaN
2013-04-05 05:00:00 NaN
2013-04-05 05:15:00 -570000000.0
would remove the need of using .apply(lambda x: x)
since it will keep that last element from the sample.
pandas.DataFrame
resample already uses mean
as default.
To have an equivalent with groupby
, you would need to do it is safer to group and then apply the mean so we can interpolate the value for the interval
> a.groupby(a.index).mean().resample("15T", base=0).last()
2013-04-05 04:15:00 -450000000.0
2013-04-05 04:30:00 NaN
2013-04-05 04:45:00 NaN
2013-04-05 05:00:00 NaN
2013-04-05 05:15:00 -570000000.0
I hope understood your question correctly. Let me know if it helps.
Edit
You could try to keep all the indices using:
> a.resample('15T').asfreq()
But you will get: ValueError: cannot reindex from a duplicate axis
.
This is the main issue, that indices in pandas cannot be duplicated. That is why groupby
with a mean works, since it groups the items in groups of one elements and then resample for that group.
One way to accomplish this without groupby
is using Multiindex dataframes:
> a.to_frame().set_index([a.index, a.index.round('15T')])
0
2013-04-05 04:15:31 2013-04-05 04:15:00 -450000000.0
2013-04-05 05:15:18 2013-04-05 05:15:00 -480000000.0
2013-04-05 05:15:19 2013-04-05 05:15:00 -610000000.0
2013-04-05 05:15:00 -580000000.0
2013-04-05 05:17:15 2013-04-05 05:15:00 -570000000.0
2013-04-05 07:06:31 2013-04-05 07:00:00 -571000000.0
2013-04-09 04:15:31 2013-04-09 04:15:00 -459843200.0
2013-04-09 05:15:18 2013-04-09 05:15:00 -481414000.0
2013-04-09 05:15:19 2013-04-09 05:15:00 -610928400.0
2013-04-09 05:15:00 -587081900.0
2013-04-09 05:17:15 2013-04-09 05:15:00 -575988800.0
2013-04-09 07:06:31 2013-04-09 07:00:00 -571336300.0
2013-04-09 07:21:28 2013-04-09 07:15:00 -52751220.0
2013-04-09 09:18:19 2013-04-09 09:15:00 -285378700.0
2013-04-09 09:19:19 2013-04-09 09:15:00 -252378200.0
2013-04-09 09:21:31 2013-04-09 09:15:00 -427326700.0
Or, altering the index order to group by the rounded index:
> a.to_frame().set_index([a.index.round('15T'), a.index])
2013-04-05 04:15:00 2013-04-05 04:15:31 -450000000.0
2013-04-05 05:15:00 2013-04-05 05:15:18 -480000000.0
2013-04-05 05:15:19 -610000000.0
2013-04-05 05:15:19 -580000000.0
2013-04-05 05:17:15 -570000000.0
2013-04-05 07:00:00 2013-04-05 07:06:31 -571000000.0
2013-04-09 04:15:00 2013-04-09 04:15:31 -459843200.0
2013-04-09 05:15:00 2013-04-09 05:15:18 -481414000.0
2013-04-09 05:15:19 -610928400.0
2013-04-09 05:15:19 -587081900.0
2013-04-09 05:17:15 -575988800.0
2013-04-09 07:00:00 2013-04-09 07:06:31 -571336300.0
2013-04-09 07:15:00 2013-04-09 07:21:28 -52751220.0
2013-04-09 09:15:00 2013-04-09 09:18:19 -285378700.0
2013-04-09 09:19:19 -252378200.0
2013-04-09 09:21:31 -427326700.0
Upvotes: 2