Reputation: 6312
I'm trying to aggregate a Pandas DataFrame by date, and then index into the result by date, but I can't seem to index it at all.
from datetime import date, datetime
import pandas as pd
import numpy as np
ts = pd.DataFrame({'data': np.random.randn(60)}, index=pd.date_range('1/1/2000', periods=60, freq='2h'))
g = ts.groupby(ts.index.date).agg(np.mean)
print(g)
data
2000-01-01 0.090038
2000-01-02 0.099970
2000-01-03 -0.619274
2000-01-04 0.027040
2000-01-05 -0.323205
First, I would expect the index to be a DatetimeIndex
, but:
print(g.index)
Index([2000-01-01, 2000-01-02, 2000-01-03, 2000-01-04, 2000-01-05], dtype='object')
g.index[0]
datetime.date(2000, 1, 1)
OK, let's try to index it:
g['2000-01-01']
KeyError: u'no item named 2000-01-01'
g[date(2000, 01, 01)]
KeyError: u'no item named 2000-01-01'
g[datetime(2000, 01, 01)]
KeyError: u'no item named 2000-01-01 00:00:00'
g[pd.to_datetime('2000-01-01')]
KeyError: u'no item named 2000-01-01 00:00:00'
I can't even index it with the index itself!
g[g.index[0]]
KeyError: u'no item named 2000-01-01'
What am I missing? Why isn't the result of grouping/aggregating a DatetimeIndex
a DatetimeIndex
, or at least a PeriodIndex
? Is there a way to make it one?
Upvotes: 2
Views: 1956
Reputation: 375485
The index is a datetime because you are passing groupby an array of datetimes:
In [11]: ts.index.date[:2]
Out[11]: array([datetime.date(2000, 1, 1), datetime.date(2000, 1, 1)], dtype=object)
a more usual way to do this would be to resample:
In [12]: res = ts.resample('D', how='mean')
In [13]: res
Out[13]:
data
2000-01-01 0.181246
2000-01-02 -0.167023
2000-01-03 -0.075843
2000-01-04 -0.218141
2000-01-05 -0.144635
or if you are doing something more complicated, you can use a TimeGrouper:
In [14]: ts.groupby(pd.TimeGrouper('D')).mean()
Out[14]:
data
2000-01-01 0.181246
2000-01-02 -0.167023
2000-01-03 -0.075843
2000-01-04 -0.218141
2000-01-05 -0.144635
these have index which is a DatetimeIndex.
With regards to accessing the rows, you should use loc, though I like to access with a Timestamp:
In [15]: res.loc['2000-01-01'] # KeyError without using loc
Out[15]:
data 0.181246
Name: 2000-01-01 00:00:00, dtype: float64
In [16]: res.loc[pd.Timestamp('2000-01-01')]
Out[16]:
data 0.181246
Name: 2000-01-01 00:00:00, dtype: float64
You can slice using these strings (but I don't think it's a super robust idea, I prefer to be use the Timestamps):
In [17]: res['2000-01-01':'2000-01-01']
Out[17]:
data
2000-01-01 0.181246
Upvotes: 2