Pandas resample() Series giving incorrect indexes

Question

I am trying to bin a multi-year time series of dates and float values. I'm trying to aggregate each day in to 15 minute bins. So I group the data set by day and then resample in 15 minute increments on each day.

The results seemed odd so I took a closer look at the behaviour of the resampling. The code below summarizes the kind of results I observed (I run it in repl.it)

aindex = pd.to_datetime([
"2013-04-05 04:15:31",
"2013-04-05 05:15:18",
"2013-04-05 05:15:19",
"2013-04-05 05:15:19",
"2013-04-05 05:17:15",
"2013-04-05 07:06:31",
"2013-04-09 04:15:31",
"2013-04-09 05:15:18",
"2013-04-09 05:15:19",
"2013-04-09 05:15:19",
"2013-04-09 05:17:15",
"2013-04-09 07:06:31",
"2013-04-09 07:21:28",
"2013-04-09 09:18:19",
"2013-04-09 09:19:19",
"2013-04-09 09:21:31"])
a = pd.Series([-4.50e+08,
-4.80e+08,
-6.10e+08,
-5.80e+08,
-5.70e+08,
-5.710e+08,
-4.598432e+08,
-4.814140e+08,
-6.109284e+08,
-5.870819e+08,
-5.759888e+08,
-5.713363e+08,
-5.275122e+07,
-2.853787e+08,
-2.523782e+08,
-4.273267e+08],aindex)
print(a)
print(a.groupby(a.index).apply(lambda x: x))
print(a.resample("15T", base=0).apply(lambda x: x))
print(a.groupby(a.index).resample("15T").apply(lambda x: x))

'groupby' behaves as expected but note that each value of 'x' is type pd.Series. 'resample' also returns type pd.Series but appears to miss values when I display it in repl.it or Jupyter but if you change .apply(lambda x: x) to .apply(lambda x: list(x)) you can see there are actually multiple values. 'groupby'+'resample' almost does what I expected ie. each day has 15 minute bins except the indexing is wrong anywhere a 'resample' returned more than one value.

I'm trying to understand what I'm seeing so I can apply the process with confidence. Is this correct behaviour and if so why?

Note: To clarify a bit more my expectations. If I look at the result of a resample for one day then resample includes empty bins:

2013-04-05 04:15:00   -450000000.0
2013-04-05 04:30:00            NaN
2013-04-05 04:45:00            NaN
2013-04-05 05:00:00            NaN
2013-04-05 05:15:00   -570000000.0
2013-04-05 05:30:00            NaN
2013-04-05 05:45:00            NaN
2013-04-05 06:00:00            NaN
2013-04-05 06:15:00            NaN
2013-04-05 06:30:00            NaN
2013-04-05 06:45:00            NaN
2013-04-05 07:00:00   -571000000.0
2013-04-05 07:15:00            NaN
2013-04-05 07:30:00            NaN
2013-04-05 07:45:00            NaN
2013-04-05 08:00:00            NaN
2013-04-05 08:15:00            NaN
2013-04-05 08:30:00            NaN
2013-04-05 08:45:00            NaN
2013-04-05 09:00:00            NaN
2013-04-05 09:15:00            NaN
2013-04-05 09:30:00            NaN
2013-04-05 09:45:00            NaN
2013-04-05 10:00:00            NaN

But if a groupby is done first I don't get empty bins. Why not?:

...
2013-04-05 04:15:31  2013-04-05 04:15:00   -450000000.0
2013-04-05 05:15:18  2013-04-05 05:15:00   -480000000.0
2013-04-05 05:15:19  2013-04-05 05:15:00   -580000000.0
2013-04-05 05:17:15  2013-04-05 05:15:00   -570000000.0
2013-04-05 07:06:31  2013-04-05 07:00:00   -571000000.0
...

Mabel Villalba · Accepted Answer

Resample is a tricky function. The main issue with the resampling is that you need to select which value you want to keep (using pandas.DataFrame.last or pandas.DataFrame.first).

So doing:

> a.resample("15T", base=0).last()

2013-04-05 04:15:00   -450000000.0
2013-04-05 04:30:00            NaN
2013-04-05 04:45:00            NaN
2013-04-05 05:00:00            NaN
2013-04-05 05:15:00   -570000000.0

would remove the need of using .apply(lambda x: x) since it will keep that last element from the sample.

pandas.DataFrame resample already uses mean as default.

To have an equivalent with groupby, you would need to do it is safer to group and then apply the mean so we can interpolate the value for the interval

> a.groupby(a.index).mean().resample("15T", base=0).last()


2013-04-05 04:15:00   -450000000.0
2013-04-05 04:30:00            NaN
2013-04-05 04:45:00            NaN
2013-04-05 05:00:00            NaN
2013-04-05 05:15:00   -570000000.0

I hope understood your question correctly. Let me know if it helps.

Edit

You could try to keep all the indices using:

> a.resample('15T').asfreq()

But you will get: ValueError: cannot reindex from a duplicate axis.

This is the main issue, that indices in pandas cannot be duplicated. That is why groupby with a mean works, since it groups the items in groups of one elements and then resample for that group.

One way to accomplish this without groupbyis using Multiindex dataframes:

> a.to_frame().set_index([a.index, a.index.round('15T')])

                                                   0
2013-04-05 04:15:31 2013-04-05 04:15:00 -450000000.0
2013-04-05 05:15:18 2013-04-05 05:15:00 -480000000.0
2013-04-05 05:15:19 2013-04-05 05:15:00 -610000000.0
                    2013-04-05 05:15:00 -580000000.0
2013-04-05 05:17:15 2013-04-05 05:15:00 -570000000.0
2013-04-05 07:06:31 2013-04-05 07:00:00 -571000000.0
2013-04-09 04:15:31 2013-04-09 04:15:00 -459843200.0
2013-04-09 05:15:18 2013-04-09 05:15:00 -481414000.0
2013-04-09 05:15:19 2013-04-09 05:15:00 -610928400.0
                    2013-04-09 05:15:00 -587081900.0
2013-04-09 05:17:15 2013-04-09 05:15:00 -575988800.0
2013-04-09 07:06:31 2013-04-09 07:00:00 -571336300.0
2013-04-09 07:21:28 2013-04-09 07:15:00  -52751220.0
2013-04-09 09:18:19 2013-04-09 09:15:00 -285378700.0
2013-04-09 09:19:19 2013-04-09 09:15:00 -252378200.0
2013-04-09 09:21:31 2013-04-09 09:15:00 -427326700.0

Or, altering the index order to group by the rounded index:

> a.to_frame().set_index([a.index.round('15T'), a.index])


2013-04-05 04:15:00 2013-04-05 04:15:31 -450000000.0
2013-04-05 05:15:00 2013-04-05 05:15:18 -480000000.0
                    2013-04-05 05:15:19 -610000000.0
                    2013-04-05 05:15:19 -580000000.0
                    2013-04-05 05:17:15 -570000000.0
2013-04-05 07:00:00 2013-04-05 07:06:31 -571000000.0
2013-04-09 04:15:00 2013-04-09 04:15:31 -459843200.0
2013-04-09 05:15:00 2013-04-09 05:15:18 -481414000.0
                    2013-04-09 05:15:19 -610928400.0
                    2013-04-09 05:15:19 -587081900.0
                    2013-04-09 05:17:15 -575988800.0
2013-04-09 07:00:00 2013-04-09 07:06:31 -571336300.0
2013-04-09 07:15:00 2013-04-09 07:21:28  -52751220.0
2013-04-09 09:15:00 2013-04-09 09:18:19 -285378700.0
                    2013-04-09 09:19:19 -252378200.0
                    2013-04-09 09:21:31 -427326700.0

Pandas resample() Series giving incorrect indexes

Answers (1)

Related Questions