SpicyClubSauce
SpicyClubSauce

Reputation: 4276

Making missing time slots and filling in 0 values in a Pandas series

I have a pandas series as such (that i've made by doing a groupby a two columns divvy_df.groupby(['from_station_name', 'starttime']).size() ):

from_station_name       starttime          
900 W Harrison          2014-07-01 08:00:00    1
                        2014-07-01 10:00:00    1
                        2014-07-01 11:00:00    1
                        2014-07-01 12:00:00    1
                        2014-07-01 13:00:00    1
                        2014-07-01 16:00:00    1
                        2014-07-01 17:00:00    3
                        2014-07-01 22:00:00    1
                        2014-07-02 01:00:00    1
                        2014-07-02 08:00:00    1
                        2014-07-02 12:00:00    2
                        ...

As you can see, the resulting series shows the number of times that starttime is logged for that station name. I'd like to make it somehow though, so that the missing time slots are created and filled in with a value of 0. Is that possible?

So I'd want something like this:

from_station_name       starttime          
    900 W Harrison          2014-07-01 00:00:00    0
                            2014-07-01 01:00:00    0
                            2014-07-01 02:00:00    0
                            2014-07-01 03:00:00    0
                            ...
                            2014-07-01 08:00:00    1
                            2014-07-01 09:00:00    0   
                            2014-07-01 10:00:00    1
                            2014-07-01 11:00:00    1
                            2014-07-01 12:00:00    1
                            2014-07-01 13:00:00    0
                            ...
                            2014-07-01 22:00:00    1
                            2014-07-01 23:00:00    0
                            2014-07-02 00:00:00    0

What's the best way to get something like this done?

Upvotes: 4

Views: 1425

Answers (1)

unutbu
unutbu

Reputation: 880937

Usually when you have a Series, ts, with DatetimeIndex, you can expand it to include new dates by using ts.reindex, or ts.asfreq or ts.resample. Since ts.reindex has a fill_value parameter for filling in missing values, I think it will be easiest to use ts.reindex in this case.

However, in this problem, ts has a MultiIndex with from_station_name and starttime levels.

So the first step is to move the from_station_name level out of the index. There are two options. You can use reset_index to move it to a column, or use unstack to move it into a new column index level. I think unstacking with

ts = ts.unstack(level=0)

is the better choice here because with the other option (using reset_index) you run into issues related to duplicate dates in the index, and the problem of back-filling and forward-filling missing from_station_name values.

Now the Series has a single level DatetimeIndex, so you can reindex like this:

index = pd.date_range('2014-07-01', '2014-07-03', freq='H')
ts = ts.reindex(index, fill_value=0)

And now undo the unstack operation, thus moving the column index level back into the row index:

ts = ts.stack()
ts = ts.swaplevel(0,1)

import pandas as pd
ts = pd.Series({('900 W Harrison', '2014-07-01 08:00:00'): 1,
 ('900 W Harrison', '2014-07-01 10:00:00'): 1,
 ('900 W Harrison', '2014-07-01 11:00:00'): 1,
 ('900 W Harrison', '2014-07-01 12:00:00'): 1,
 ('900 W Harrison', '2014-07-01 13:00:00'): 1,
 ('900 W Harrison', '2014-07-01 16:00:00'): 1,
 ('900 W Harrison', '2014-07-01 17:00:00'): 3,
 ('900 W Harrison', '2014-07-01 22:00:00'): 1,
 ('900 W Harrison', '2014-07-02 01:00:00'): 1,
 ('900 W Harrison', '2014-07-02 08:00:00'): 1,
 ('900 W Harrison', '2014-07-02 12:00:00'): 2})

ts = ts.unstack(level=0)
# ensure ts.index is a DatetimeIndex
ts.index = pd.to_datetime(ts.index)
index = pd.date_range('2014-07-01', '2014-07-03', freq='H')
ts = ts.reindex(index, fill_value=0)
ts = ts.stack()
ts = ts.swaplevel(0,1)
ts = ts.sortlevel()
print(ts)

yields

900 W Harrison  2014-07-01 00:00:00    0
                2014-07-01 01:00:00    0
                2014-07-01 02:00:00    0
                2014-07-01 03:00:00    0
                2014-07-01 04:00:00    0
                2014-07-01 05:00:00    0
                2014-07-01 06:00:00    0
                2014-07-01 07:00:00    0
                2014-07-01 08:00:00    1
                2014-07-01 09:00:00    0
                2014-07-01 10:00:00    1
                ...
                2014-07-02 21:00:00    0
                2014-07-02 22:00:00    0
                2014-07-02 23:00:00    0
                2014-07-03 00:00:00    0
dtype: int64

Upvotes: 5

Related Questions