Reputation: 4276
I have a pandas series as such
(that i've made by doing a groupby a two columns
divvy_df.groupby(['from_station_name', 'starttime']).size()
):
from_station_name starttime
900 W Harrison 2014-07-01 08:00:00 1
2014-07-01 10:00:00 1
2014-07-01 11:00:00 1
2014-07-01 12:00:00 1
2014-07-01 13:00:00 1
2014-07-01 16:00:00 1
2014-07-01 17:00:00 3
2014-07-01 22:00:00 1
2014-07-02 01:00:00 1
2014-07-02 08:00:00 1
2014-07-02 12:00:00 2
...
As you can see, the resulting series shows the number of times that starttime is logged for that station name. I'd like to make it somehow though, so that the missing time slots are created and filled in with a value of 0. Is that possible?
So I'd want something like this:
from_station_name starttime
900 W Harrison 2014-07-01 00:00:00 0
2014-07-01 01:00:00 0
2014-07-01 02:00:00 0
2014-07-01 03:00:00 0
...
2014-07-01 08:00:00 1
2014-07-01 09:00:00 0
2014-07-01 10:00:00 1
2014-07-01 11:00:00 1
2014-07-01 12:00:00 1
2014-07-01 13:00:00 0
...
2014-07-01 22:00:00 1
2014-07-01 23:00:00 0
2014-07-02 00:00:00 0
What's the best way to get something like this done?
Upvotes: 4
Views: 1425
Reputation: 880937
Usually when you have a Series, ts
, with DatetimeIndex, you can expand it to
include new dates by using ts.reindex
, or ts.asfreq
or ts.resample
.
Since ts.reindex
has a fill_value
parameter for filling in missing values, I
think it will be easiest to use ts.reindex
in this case.
However, in this problem, ts
has a MultiIndex with from_station_name
and
starttime
levels.
So the first step is to move the from_station_name
level out of the index.
There are two options. You can use reset_index
to move it to a column, or use
unstack
to move it into a new column index level. I think unstacking
with
ts = ts.unstack(level=0)
is the better choice here because with the other option (using reset_index
)
you run into issues related to duplicate dates in the index, and the problem of
back-filling and forward-filling missing from_station_name
values.
Now the Series has a single level DatetimeIndex, so you can reindex
like this:
index = pd.date_range('2014-07-01', '2014-07-03', freq='H')
ts = ts.reindex(index, fill_value=0)
And now undo the unstack
operation, thus moving the column index level back
into the row index:
ts = ts.stack()
ts = ts.swaplevel(0,1)
import pandas as pd
ts = pd.Series({('900 W Harrison', '2014-07-01 08:00:00'): 1,
('900 W Harrison', '2014-07-01 10:00:00'): 1,
('900 W Harrison', '2014-07-01 11:00:00'): 1,
('900 W Harrison', '2014-07-01 12:00:00'): 1,
('900 W Harrison', '2014-07-01 13:00:00'): 1,
('900 W Harrison', '2014-07-01 16:00:00'): 1,
('900 W Harrison', '2014-07-01 17:00:00'): 3,
('900 W Harrison', '2014-07-01 22:00:00'): 1,
('900 W Harrison', '2014-07-02 01:00:00'): 1,
('900 W Harrison', '2014-07-02 08:00:00'): 1,
('900 W Harrison', '2014-07-02 12:00:00'): 2})
ts = ts.unstack(level=0)
# ensure ts.index is a DatetimeIndex
ts.index = pd.to_datetime(ts.index)
index = pd.date_range('2014-07-01', '2014-07-03', freq='H')
ts = ts.reindex(index, fill_value=0)
ts = ts.stack()
ts = ts.swaplevel(0,1)
ts = ts.sortlevel()
print(ts)
yields
900 W Harrison 2014-07-01 00:00:00 0
2014-07-01 01:00:00 0
2014-07-01 02:00:00 0
2014-07-01 03:00:00 0
2014-07-01 04:00:00 0
2014-07-01 05:00:00 0
2014-07-01 06:00:00 0
2014-07-01 07:00:00 0
2014-07-01 08:00:00 1
2014-07-01 09:00:00 0
2014-07-01 10:00:00 1
...
2014-07-02 21:00:00 0
2014-07-02 22:00:00 0
2014-07-02 23:00:00 0
2014-07-03 00:00:00 0
dtype: int64
Upvotes: 5