Reputation: 340
I have a pandas dataframe containing ~20,xxx records of bus boarding data. The dataset contains a cardNumber
field which is unique for each passenger. There is a type
field that identifies the type of boarding. There is a routeName
column specifying which route the boarding took place on and finally a Date
column which identifies when the boarding took place. I have provided a mock dataframe below.
df = pd.DataFrame(
{'cardNumber': ['999', '999', '999', '999', '901', '901', '888', '888'],
'type': ['trip_pass', 'transfer', 'trip_pass', 'transfer', 'stored_value', 'transfer', 'trip_pass',
'trip_pass'],
'routeName': ['1', '2', '2', '1', '20', '3', '4', '4'],
'Date': ['2020-08-01 06:18:56 -04:00', '2020-08-01 06:46:12 -04:00', '2020-08-01 17:13:51 -04:00',
'2020-08-01 17:47:32 -04:00', '2020-08-10 15:23:16 -04:00', '2020-08-10 15:44:45 -04:00',
'2020-08-31 06:54:09 -04:00', '2020-08-31 16:23:41 -04:00']}
)
df['Date'] = pd.to_datetime(df['Date'])
What I would like to do is summarize the transfer activity. How many transfers, on average, occurred from Route 1 to Route 2 or from Route 2 to Route 1. There are eleven different routes in the dataset that transfers could happen between.
I would like the output to look something like (note that the output below was not generated from the sample provided above):
From | To | Avg. Daily
----------------------------------
1 | 2 | 45.7
1 | 3 | 22.6
20 | 1 | 12.2
Upvotes: 2
Views: 713
Reputation: 4929
The following code works in the chunk data you provided. If it doesn't work in your actual data, pls let me know. There's probably better ways to do that, but I think it's a good starting point.
The general idea here is to group by passengers to figure out the route. Then, since you want the daily average, you gonna need to group by date, and then by destinations in order to calculate the daily mean.
# Define a function to get routes' relationship (origin vs destination)
def get_routes(x):
if 'transfer' not in x.type.tolist(): # if no 'transfer' type in group, leave it as 0 (we'll remove them afterwards)
return 0
x = x[x.type == 'transfer'] # select target type
date = df[df.cardNumber=='999'].Date.dt.strftime('%m/%d/%Y').unique()
if date.size == 1: # if there is more than one date by passenger, you'll need to change this code
date = date[0]
else:
raise Exception("There are more than one date per passenger, please adapt your code.")
s_from = x.routeName[x.Date.idxmin()] # get route from the first date
s_to = x.routeName[x.Date.idxmax()] # get route from the last date
return date, s_from, s_to
# Define a function to get the routes' daily average
def get_daily_avg(date_group):
daily_avg = (
date_group.groupby(['From', 'To'], as_index=False) # group the day by routes
.apply(lambda route: route.shape[0] / date_group.shape[0]) # divide the total of trips of that route by the total trips of that day
)
return daily_avg
# Get route's relationship
routes_series = df.groupby('cardNumber').apply(get_routes) # retrive routes per passenger
routes_series = routes_series[routes_series!=0] # remove groups without the target type
# Create a named dataframe from the series output
routes_df = pd.DataFrame(routes_series.tolist(), columns=['Date', 'From', 'To'])
# Create dataframe, perform filter and calculations
daily_routes_df = (
routes_df.query('From != To') # remove routes with same destination as the origin
.groupby('Date').apply(get_daily_avg) # calculate the mean per date
.rename(columns={None: 'Avg. Daily'}) # set name to previous output
.drop(['From','To'], axis = 1) # drop out redundant info since there's such info at the index
.reset_index() # remove MultiIndex to get a tidy dataframe
)
# Visualize results
print(daily_routes_df)
Output:
Date From To Avg. Daily
0 08/01/2020 2 1 1.0
Here, the average is 1 since there's only one count per group. Note that only the "transfer" type has been taking into account. The ones without it, or with no changing route, were further removed.
Upvotes: 1
Reputation: 4521
If I got the question right, you would like to get the trip start and end out of your events and the first event corresponds with the starting point (the route name) and then calculate the number of tickets you have in your data set that have the same starting and endpoint.
If so, you can do this as follows
# srot the dataframe so you can use first/last
df_sorted= df.sort_values(['cardNumber', 'Date']).reset_index(drop=True)
# calculate the counts do the counts, but only
# from the defined types
indexer_trip_points= df_sorted['type'].isin(['transfer'])
df_from_to= df_sorted[indexer_trip_points].groupby('cardNumber').agg(
start_date=('Date', 'first'),
trip_start=('routeName', 'first'),
trip_end=('routeName', 'last'),
)
df_from_to['start_date']= df_from_to['start_date'].dt.date
df_counts= df_from_to.groupby(['trip_start', 'trip_end', 'start_date']).agg(
count=('trip_start', 'count')
)
df_counts.reset_index(drop=False, inplace=True)
df_counts.groupby(['trip_start', 'trip_end']).agg(
avg=('count', 'mean')
)
This results in:
avg
trip_start trip_end
2 1 1
3 3 1
As you can notice, the last entry has the same start- as endpoint. So probably you need to filter out trips, for which you don't have complete data yet. E.g. if in your case a route can never end with the same routeName as it started, you could simply filter them by comparing the two columns.
Upvotes: 0