Reputation: 775
Given the below pandas DataFrame:
In [115]: times = pd.to_datetime(pd.Series(['2014-08-25 21:00:00','2014-08-25 21:04:00',
'2014-08-25 22:07:00','2014-08-25 22:09:00']))
locations = ['HK', 'LDN', 'LDN', 'LDN']
event = ['foo', 'bar', 'baz', 'qux']
df = pd.DataFrame({'Location': locations,
'Event': event}, index=times)
df
Out[115]:
Event Location
2014-08-25 21:00:00 foo HK
2014-08-25 21:04:00 bar LDN
2014-08-25 22:07:00 baz LDN
2014-08-25 22:09:00 qux LDN
I would like resample the data to aggregate it hourly by count while grouping by location to produce a data frame that looks like this:
Out[115]:
HK LDN
2014-08-25 21:00:00 1 1
2014-08-25 22:00:00 0 2
I've tried various combinations of resample() and groupby() but with no luck. How would I go about this?
Upvotes: 75
Views: 79797
Reputation: 4564
pd.resample
>>> pd.groupby()
seems to be the quite faster by many times
df =
PJMW_MW
Datetime
2002-04-01 01:00:00 4374.0
2002-04-01 02:00:00 4306.0
2002-04-01 03:00:00 4322.0
2002-04-01 04:00:00 4359.0
2002-04-01 05:00:00 4436.0
... ...
2017-12-31 19:00:00 8205.0
2017-12-31 20:00:00 8053.0
2017-12-31 21:00:00 8012.0
2017-12-31 22:00:00 7864.0
2017-12-31 23:00:00 7710.0
138066 rows × 1 columns
%timeit df.resample(rule='24H', kind='interval').mean()
3.45 ms ± 41.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby(df.index.strftime('%Y-%m-%d')).mean()
169 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 1
Reputation: 2865
df = (df.groupby("Location")
.resample("H", on="date")
.last()
.reset_index(drop=True))
Upvotes: 7
Reputation: 424
This can be done without using resample
or Grouper
as follows:
df.groupby([df.index.floor("1H"), "Location"]).count()
Upvotes: 2
Reputation: 879591
In my original post, I suggested using pd.TimeGrouper
.
Nowadays, use pd.Grouper
instead of pd.TimeGrouper
. The syntax is largely the same, but TimeGrouper
is now deprecated in favor of pd.Grouper
.
Moreover, while pd.TimeGrouper
could only group by DatetimeIndex, pd.Grouper
can group by datetime columns which you can specify through the key
parameter.
You could use a pd.Grouper
to group the DatetimeIndex'ed DataFrame by hour:
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
use count
to count the number of events in each group:
grouper['Event'].count()
# Location
# 2014-08-25 21:00:00 HK 1
# LDN 1
# 2014-08-25 22:00:00 LDN 2
# Name: Event, dtype: int64
use unstack
to move the Location
index level to a column level:
grouper['Event'].count().unstack()
# Out[49]:
# Location HK LDN
# 2014-08-25 21:00:00 1 1
# 2014-08-25 22:00:00 NaN 2
and then use fillna
to change the NaNs into zeros.
Putting it all together,
grouper = df.groupby([pd.Grouper(freq='1H'), 'Location'])
result = grouper['Event'].count().unstack('Location').fillna(0)
yields
Location HK LDN
2014-08-25 21:00:00 1 1
2014-08-25 22:00:00 0 2
Upvotes: 106
Reputation: 4744
untubu is spot on with his answer but I wanted to add in what you could do if you had a third column, say Cost
and wanted to aggregate it like above. It was through combining unutbu's answer and this one that I found out how to do this and thought I would share for future users.
Create a DataFrame with Cost
column:
In[1]:
import pandas as pd
import numpy as np
times = pd.to_datetime([
"2014-08-25 21:00:00", "2014-08-25 21:04:00",
"2014-08-25 22:07:00", "2014-08-25 22:09:00"
])
df = pd.DataFrame({
"Location": ["HK", "LDN", "LDN", "LDN"],
"Event": ["foo", "bar", "baz", "qux"],
"Cost": [20, 24, 34, 52]
}, index = times)
df
Out[1]:
Location Event Cost
2014-08-25 21:00:00 HK foo 20
2014-08-25 21:04:00 LDN bar 24
2014-08-25 22:07:00 LDN baz 34
2014-08-25 22:09:00 LDN qux 52
Now we group by using the agg
function to specify each column's aggregation method, e.g. count, mean, sum, etc.
In[2]:
grp = df.groupby([pd.Grouper(freq = "1H"), "Location"]) \
.agg({"Event": np.size, "Cost": np.mean})
grp
Out[2]:
Event Cost
Location
2014-08-25 21:00:00 HK 1 20
LDN 1 24
2014-08-25 22:00:00 LDN 2 43
Then the final unstack
with fill NaN
with zeros and display as int
because it's nice.
In[3]:
grp.unstack().fillna(0).astype(int)
Out[3]:
Event Cost
Location HK LDN HK LDN
2014-08-25 21:00:00 1 1 20 24
2014-08-25 22:00:00 0 2 0 43
Upvotes: 16
Reputation: 61967
There are two options for doing this. They actually can give different results based on your data. The first option groups by Location and within Location groups by hour. The second option groups by Location and hour at the same time.
Option 1: Use groupby + resample
grouped = df.groupby('Location').resample('H')['Event'].count()
Option 2: Group both the location and DatetimeIndex together with groupby(pd.Grouper)
grouped = df.groupby(['Location', pd.Grouper(freq='H')])['Event'].count()
They both will result in the following:
Location
HK 2014-08-25 21:00:00 1
LDN 2014-08-25 21:00:00 1
2014-08-25 22:00:00 2
Name: Event, dtype: int64
And then reshape:
grouped.unstack('Location', fill_value=0)
Will output
Location HK LDN
2014-08-25 21:00:00 1 1
2014-08-25 22:00:00 0 2
Upvotes: 54