ghost
ghost

Reputation: 39

reindex multiindex pandas dataframe

Regards.

I'm Struggling trying to figure out how to do the next operation in pandas:

I have a csv file with timestamps of stations like the following:

head of the file

The next thing I do is the following pivot_table using pandas:

trips.pivot_table('bike', aggfunc='count',
                        index=['date', 'hour'],
                        columns='station_arrived').fillna(0)

returning something like this:

enter image description here

My Problem is the following:

I want to reindex the 'hour' column to have indexes from 0 to 23 hours per day, even if there aren't counts that day.

Doing reindex with only one index is easy, but things get complicated when I'm trying this in a multiindex dataframe

Is there any way to make that possible?

Upvotes: 2

Views: 903

Answers (1)

Alexander
Alexander

Reputation: 109520

import datetime as dt
import pandas as pd
from pandas import Timestamp

df = pd.DataFrame(
    {'action': ['C', 'C', 'C', 'C', 'C', 'A', 'C'],
     'bike': [89, 89, 57, 29, 76, 69, 17],
     'cust_id': [6, 6, 30, 30, 30, 30, 30],
     'date': [Timestamp('2010-02-02 00:00:00'),
              Timestamp('2010-02-02 00:00:00'),
              Timestamp('2010-02-05 00:00:00'),
              Timestamp('2010-02-05 00:00:00'),
              Timestamp('2010-02-05 00:00:00'),
              Timestamp('2010-02-05 00:00:00'),
              Timestamp('2010-02-05 00:00:00')],
     'date_arrived': [Timestamp('2010-02-02 14:27:00'),
                      Timestamp('2010-02-02 15:42:00'),
                      Timestamp('2010-02-05 12:06:00'),
                      Timestamp('2010-02-05 12:07:00'),
                      Timestamp('2010-02-05 13:11:00'),
                      Timestamp('2010-02-05 13:14:00'),
                      Timestamp('2010-02-05 13:45:00')],
     'date_removed': [Timestamp('2010-02-02 13:57:00'),
                      Timestamp('2010-02-02 15:12:00'),
                      Timestamp('2010-02-05 11:36:00'),
                      Timestamp('2010-02-05 11:37:00'),
                      Timestamp('2010-02-05 12:41:00'),
                      Timestamp('2010-02-05 12:44:00'),
                      Timestamp('2010-02-05 13:15:00')],
     'hour': [14, 15, 12, 12, 13, 13, 13],
     'station_arrived': [56, 56, 85, 85, 85, 85, 85],
     'station_removed': [56, 56, 85, 85, 85, 85, 85]})

First, create an hourly index spanning your date range:

idx = pd.date_range(df.date.min(), df.date.max() + dt.timedelta(days=1), freq='H')

Now you want to have a datetime index, so set it to 'date_arrived'. Then use groupby with both TimeGrouper to group on hours and on station_arrived. count the number of non-null station_arrived values. Unstack the results to get the data in the pivot table format.

Finally, use reindex to set the index on your new hourly interval idx index, and fill null values with zero.

>>> (df
     .set_index('date_arrived')
     .groupby([pd.TimeGrouper('H'), 'station_arrived'])
     .station_arrived
     .count()
     .unstack()
     .reindex(idx)
     .fillna(0)
     )
station_arrived      56  85
2010-02-02 00:00:00   0   0
2010-02-02 01:00:00   0   0
2010-02-02 02:00:00   0   0
2010-02-02 03:00:00   0   0
2010-02-02 04:00:00   0   0
2010-02-02 05:00:00   0   0
2010-02-02 06:00:00   0   0
2010-02-02 07:00:00   0   0
2010-02-02 08:00:00   0   0
2010-02-02 09:00:00   0   0
2010-02-02 10:00:00   0   0
2010-02-02 11:00:00   0   0
2010-02-02 12:00:00   0   0
2010-02-02 13:00:00   0   0
2010-02-02 14:00:00   1   0
2010-02-02 15:00:00   1   0
2010-02-02 16:00:00   0   0
...

Upvotes: 2

Related Questions