Reputation: 39
Regards.
I'm Struggling trying to figure out how to do the next operation in pandas:
I have a csv file with timestamps of stations like the following:
The next thing I do is the following pivot_table using pandas:
trips.pivot_table('bike', aggfunc='count',
index=['date', 'hour'],
columns='station_arrived').fillna(0)
returning something like this:
My Problem is the following:
I want to reindex the 'hour' column to have indexes from 0 to 23 hours per day, even if there aren't counts that day.
Doing reindex with only one index is easy, but things get complicated when I'm trying this in a multiindex dataframe
Is there any way to make that possible?
Upvotes: 2
Views: 903
Reputation: 109520
import datetime as dt
import pandas as pd
from pandas import Timestamp
df = pd.DataFrame(
{'action': ['C', 'C', 'C', 'C', 'C', 'A', 'C'],
'bike': [89, 89, 57, 29, 76, 69, 17],
'cust_id': [6, 6, 30, 30, 30, 30, 30],
'date': [Timestamp('2010-02-02 00:00:00'),
Timestamp('2010-02-02 00:00:00'),
Timestamp('2010-02-05 00:00:00'),
Timestamp('2010-02-05 00:00:00'),
Timestamp('2010-02-05 00:00:00'),
Timestamp('2010-02-05 00:00:00'),
Timestamp('2010-02-05 00:00:00')],
'date_arrived': [Timestamp('2010-02-02 14:27:00'),
Timestamp('2010-02-02 15:42:00'),
Timestamp('2010-02-05 12:06:00'),
Timestamp('2010-02-05 12:07:00'),
Timestamp('2010-02-05 13:11:00'),
Timestamp('2010-02-05 13:14:00'),
Timestamp('2010-02-05 13:45:00')],
'date_removed': [Timestamp('2010-02-02 13:57:00'),
Timestamp('2010-02-02 15:12:00'),
Timestamp('2010-02-05 11:36:00'),
Timestamp('2010-02-05 11:37:00'),
Timestamp('2010-02-05 12:41:00'),
Timestamp('2010-02-05 12:44:00'),
Timestamp('2010-02-05 13:15:00')],
'hour': [14, 15, 12, 12, 13, 13, 13],
'station_arrived': [56, 56, 85, 85, 85, 85, 85],
'station_removed': [56, 56, 85, 85, 85, 85, 85]})
First, create an hourly index spanning your date range:
idx = pd.date_range(df.date.min(), df.date.max() + dt.timedelta(days=1), freq='H')
Now you want to have a datetime index, so set it to 'date_arrived'. Then use groupby
with both TimeGrouper
to group on hours and on station_arrived
. count
the number of non-null station_arrived
values. Unstack the results to get the data in the pivot table format.
Finally, use reindex
to set the index on your new hourly interval idx
index, and fill null values with zero.
>>> (df
.set_index('date_arrived')
.groupby([pd.TimeGrouper('H'), 'station_arrived'])
.station_arrived
.count()
.unstack()
.reindex(idx)
.fillna(0)
)
station_arrived 56 85
2010-02-02 00:00:00 0 0
2010-02-02 01:00:00 0 0
2010-02-02 02:00:00 0 0
2010-02-02 03:00:00 0 0
2010-02-02 04:00:00 0 0
2010-02-02 05:00:00 0 0
2010-02-02 06:00:00 0 0
2010-02-02 07:00:00 0 0
2010-02-02 08:00:00 0 0
2010-02-02 09:00:00 0 0
2010-02-02 10:00:00 0 0
2010-02-02 11:00:00 0 0
2010-02-02 12:00:00 0 0
2010-02-02 13:00:00 0 0
2010-02-02 14:00:00 1 0
2010-02-02 15:00:00 1 0
2010-02-02 16:00:00 0 0
...
Upvotes: 2