Reputation: 31
This is my first question on StackOverflow so I've tried to be as clear and concise as possible. Many thanks for your patience in advance.
Background
I have a dataset of train data with 17 attributes, these include: origin_station_code
, origin_station
, destination_station_code
, destination_station
, route_code
, start_time
, end_time
, fleet_number
, station_code
, station
, station_type
, platform
, sch_arr_time
, sch_dep_time
, act_arr_time
, act_dep_time
, date
.
Of these attributes, I am only concerned with: date
, origin_station
, destination_station
, and start_time
.
This dataset consists of 61 individual CSV files that were combined together to form one DataFrame of just over a million rows using the glob function and a loop.
Each row of the DataFrame represents an individual stop of the train journey. A full route is made of several stops, an example of a route consisting of 19 stops, Sugar Wave to Attempt Pin, is shown in the following screenshot: here.
A new attribute called complete_route name
has been created by concatenating the origin_station
and destination_station
attributes. This can identify all the routes, of which there are 81 unique entries.
The task
My task is to subset the DataFrame using pandas such that the dataset shows the 3 most popular routes, per date. This subset DataFrame should show the date
, the complete_route name
, and a count of the number times that, that route has taken place each day. The number of unique times a route has taken place can be determined by applying the nunique method to the start_time
attribute (date/time type).
My current progress
Currently, my GroupBy and Aggregate code is able to show how many times each route ran per day, as follows:
df_grouped = df.groupby(
['date', 'complete_route_name']
).agg(
{
'start_time': 'nunique' # count the number of unique routes by using the 'nunique' of the start_times
}
).reset_index()
I now however want to now take my existing code so that it only shows the top 3 unique routes by count, per day, e.g.
date complete_route_name count
2015-08-01 Attempt Pin to Roll Test 101
Suit Treatment Turnback to Spiders Toothbrush 93
Concourse Village to Port Morris 87
2015-08-02 Bridge Bottle to Ants Attempt 119
North Riverdale to Eastchester 117
Wakefield to Kingsbridge 101
......
2015-09-30 Castleton Corners to Dongan Hills 121
Eltingville to Graniteville 119
Great Kills to Castleton 117
Any help with this would be greatly appreciated!
Additional resources
The original dataset and my workbook in its current state can be found hosted on my GitHub if that is of any use/interest. A static workbook can also be viewed here.
Many thanks!
Upvotes: 2
Views: 1228
Reputation: 124
df_sorted_by_group = df_grouped.groupby(['Date']).apply(
lambda x: x.sort_values(['Count'],ascending = False)
).reset_index(drop = True)
df_grouped_top16 = df_sorted_by_group.groupby(['Date']).head(16)
Upvotes: 1
Reputation: 4681
I will continue from where you left
df_agg = df.groupby(['date', 'route_name']).agg({'start_time':'nunique'}).reset_index()
Then I would do the following to solve for what you asked for
df_sorted_by_group = df_agg.groupby(['date']).apply(
lambda x: x.sort_values(['start_time'],ascending = False)
).reset_index(drop = True)
Final step
df_final = df_sorted_by_group.groupby(['date']).head(3)
Example code
import pandas as pd
routes = {'route_name': [ 'A to B', 'A to B', 'B to C', 'B to C', 'C to D', 'C to D', 'C to D', 'C to D', 'D to E',
'A to Z', 'A to Z', 'B to Z', 'B to Z', 'C to Z', 'C to Z', 'C to Z', 'C to Z', 'D to Z'],
'date': ['01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015','01/01/2015',
'02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015','02/01/2015'],
'start_time': ['A1','A2','A3','A4','A5','A6','A7','A8','A9','A10','A11','A12','A13','A14','A15','A16','A17','A18']
}
df = pd.DataFrame(routes)
df['date'] = pd.to_datetime(df['date'],format ='%d/%m/%Y')
df
route_name date start_time
0 A to B 2015-01-01 A1
1 A to B 2015-01-01 A2
2 B to C 2015-01-01 A3
3 B to C 2015-01-01 A4
4 C to D 2015-01-01 A5
5 C to D 2015-01-01 A6
6 C to D 2015-01-01 A7
7 C to D 2015-01-01 A8
8 D to E 2015-01-01 A9
9 A to Z 2015-01-02 A10
10 A to Z 2015-01-02 A11
11 B to Z 2015-01-02 A12
12 B to Z 2015-01-02 A13
13 C to Z 2015-01-02 A14
14 C to Z 2015-01-02 A15
15 C to Z 2015-01-02 A16
16 C to Z 2015-01-02 A17
17 D to Z 2015-01-02 A18
After applying script from above, you get the following results
df_final
date route_name start_time
0 2015-01-01 C to D 4
1 2015-01-01 A to B 2
2 2015-01-01 B to C 2
4 2015-01-02 C to Z 4
5 2015-01-02 A to Z 2
6 2015-01-02 B to Z 2
Upvotes: 2
Reputation: 13387
Ok, so starting with your working part, I would rewrite it to:
df_grouped = df.groupby(
['date', 'complete_route_name'], as_index=False
)['start_time'].nunique()
Next IIUC you can do:
df2=df_grouped.groupby("date").rank().le(3)
df_grouped.loc[df2.loc[df2].index]
Upvotes: 0