Reputation: 742
I am working on a dataset containing data from taxi rides in the city of Chicago. The data contains information such as Taxi ID, Timestamp, Fare etc. among others. A sample of the data is shown below with the timestamps in pandas datetime:
For a given timeframe, which can be an hour/day/week, I want to compute the median amount: how much a typical taxi makes and analyze its profile throughout the dataset's timeframe. For example, if I want to consider the data hourly, I am visualizing how the median revenue of a taxi changes by hour by creating a pivot table between Taxi ID and the Hour, and then taking the median of the total amount made.
taxiByHour = taxiData.pivot_table(index='Taxi ID',columns='Hour',aggfunc=sum)
taxiByHour.fillna(0,inplace=True)
taxiByHour['Trip Total'].median().plot(kind='bar', x='hour', rot = 0, color = 'green')
This is obviously computing the median for all rides that have taken place in the data during that hour. How can I extend this such that, I can display this median value by hour continuously by date for all the timespan in the data? So if I am considering data for a week, I want to display the profile of the median revenue of each hour for 24*7 = 168 hours.
Upvotes: 0
Views: 1451
Reputation: 837
If you have the data in rows with columns=["Taxi ID", "Timestamp", "Fare"]
then convert the timestamp to datetime.datetime
format (if it isn't already) and make it the index, i.e. df.index=df["Timestamp"]
. Then you can use pd.DataFrame.resample
with a rule (in your case 'H'
) and a built in method (see https://pandas.pydata.org/pandas-docs/stable/api.html#resampling) or .apply
your own. For example:
df = pd.DataFrame([[1],[2]],index=[datetime(2017,1,1,1),datetime(2017,1,1,2)])
resampled = df.resample('H').median()
Upvotes: 1