HighVoltage
HighVoltage

Reputation: 742

Pandas: Compute median of data per hour, group by date

I am working on a dataset containing data from taxi rides in the city of Chicago. The data contains information such as Taxi ID, Timestamp, Fare etc. among others. A sample of the data is shown below with the timestamps in pandas datetime:

enter image description here

For a given timeframe, which can be an hour/day/week, I want to compute the median amount: how much a typical taxi makes and analyze its profile throughout the dataset's timeframe. For example, if I want to consider the data hourly, I am visualizing how the median revenue of a taxi changes by hour by creating a pivot table between Taxi ID and the Hour, and then taking the median of the total amount made.

taxiByHour = taxiData.pivot_table(index='Taxi ID',columns='Hour',aggfunc=sum)
taxiByHour.fillna(0,inplace=True)
taxiByHour['Trip Total'].median().plot(kind='bar', x='hour', rot = 0, color = 'green')

This is obviously computing the median for all rides that have taken place in the data during that hour. How can I extend this such that, I can display this median value by hour continuously by date for all the timespan in the data? So if I am considering data for a week, I want to display the profile of the median revenue of each hour for 24*7 = 168 hours.

Upvotes: 0

Views: 1451

Answers (1)

GratefulGuest
GratefulGuest

Reputation: 837

If you have the data in rows with columns=["Taxi ID", "Timestamp", "Fare"] then convert the timestamp to datetime.datetime format (if it isn't already) and make it the index, i.e. df.index=df["Timestamp"]. Then you can use pd.DataFrame.resample with a rule (in your case 'H') and a built in method (see https://pandas.pydata.org/pandas-docs/stable/api.html#resampling) or .apply your own. For example:

df = pd.DataFrame([[1],[2]],index=[datetime(2017,1,1,1),datetime(2017,1,1,2)])
resampled = df.resample('H').median()

Upvotes: 1

Related Questions