Reputation: 21
I read an excel file to get a data frame that looks like this;
Date Lane Lane Name Direction DirectionName Speed (mph) Headway (s) Gap (s) Flags Flag Text
0 2018-02-02 00:00:03.000 6 SB_NS 2 South 38.525 NaN NaN 5 Friday
1 2018-02-02 00:00:22.010 5 SB_MID 2 South 32.310 NaN NaN 5 Friday
2 2018-02-02 00:00:22.020 4 SB_OS 2 South 44.739 NaN NaN 5 Friday
3 2018-02-02 00:00:36.040 6 SB_NS 2 South 33.554 NaN NaN 5 Friday
4 2018-02-02 00:00:49.070 6 SB_NS 2 South 39.768 12.300 11.847 5 Friday
... ... ... ... ... ... ... ... ... ... ...
503763 2018-02-27 23:59:00.090 2 NB_MID 1 North 32.932 4.415 3.833 2 Tuesday
503764 2018-02-27 23:59:29.090 6 SB_NS 2 South 29.825 65.500 64.700 2 Tuesday
503765 2018-02-27 23:59:32.050 4 SB_OS 2 South 29.205 236.000 235.848 2 Tuesday
503766 2018-02-27 23:59:33.070 6 SB_NS 2 South 37.283 3.330 3.462 2 Tuesday
503767 2018-02-27 23:59:58.050 1 NB_NS 1 North 36.661 76.000 75.669 2 Tuesday
503768 rows × 10 columns
I use this code:
## ONLY SHOW 02-02-2018 and SOUTH and COUNT
df = df.loc[(df.DirectionName =="South")]
df['Date'] = pd.to_datetime(df['Date'])
start_date = '2018-02-02'
end_date = '2018-02-03'
mask = (df['Date'] > start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
df.groupby([pd.Grouper(key='Date',freq='H'),df.DirectionName]).size().reset_index(name='count')
to get this output:
Date DirectionName count
0 2018-02-02 00:00:00 South 250
1 2018-02-02 01:00:00 South 137
2 2018-02-02 02:00:00 South 92
3 2018-02-02 03:00:00 South 92
4 2018-02-02 04:00:00 South 110
5 2018-02-02 05:00:00 South 272
6 2018-02-02 06:00:00 South 832
7 2018-02-02 07:00:00 South 1972
8 2018-02-02 08:00:00 South 2106
9 2018-02-02 09:00:00 South 1695
10 2018-02-02 10:00:00 South 1493
11 2018-02-02 11:00:00 South 1716
12 2018-02-02 12:00:00 South 1970
13 2018-02-02 13:00:00 South 2081
14 2018-02-02 14:00:00 South 2363
15 2018-02-02 15:00:00 South 2583
16 2018-02-02 16:00:00 South 2746
17 2018-02-02 17:00:00 South 2647
18 2018-02-02 18:00:00 South 2107
19 2018-02-02 19:00:00 South 1521
20 2018-02-02 20:00:00 South 1047
21 2018-02-02 21:00:00 South 851
22 2018-02-02 22:00:00 South 813
23 2018-02-02 23:00:00 South 557
The count is volume of data within that time frame, for example a volume of 250 between 0:00 - 1:00 on the 02/02/2012.
I want to be able to generate this output for every friday of the month (02/02/2018 + 09/02/2018 + 16/02/2018 + 23/02/2018). and then Add up the total for each hour across the four dates.
However, I am having trouble even being able to generate this output on the same notebook for each date. I can do it once in the same notebook, no problem. I can do it in separate notebooks, no problem. However I can not get it to work all at once within the same note book.
I have tried changing the date to include all dates within the range...
## ONLY SHOW 02-02-2018 and SOUTH and then COUNT
df = df.loc[(df.DirectionName =="South")]
df['Date'] = pd.to_datetime(df['Date'])
start_date = '2018-02-02' #CHANGED#
end_date = '2018-02-24' #CHANGED#
mask = (df['Date'] > start_date) & (df['Date'] <= end_date)
df = df.loc[mask]
df.groupby([pd.Grouper(key='Date',freq='H'),df.DirectionName]).size().reset_index(name='count')
But this will then obviously include dates which I don't want....
Date DirectionName count1
0 2018-02-02 00:00:00 South 250
1 2018-02-02 01:00:00 South 137
2 2018-02-02 02:00:00 South 92
3 2018-02-02 03:00:00 South 92
4 2018-02-02 04:00:00 South 110
... ... ... ...
163 2018-02-23 19:00:00 South 1490
164 2018-02-23 20:00:00 South 1056
165 2018-02-23 21:00:00 South 858
166 2018-02-23 22:00:00 South 783
167 2018-02-23 23:00:00 South 563
And even using this method, I am confused as to how I would ADD up all of the counts for the specific time+dates. I am doing this so that I would then be able to calculate the AVERAGE COUNT at 9am for Fridays during that month.
For example add together the count at 9am for 02/02/2018 + 09/02/2018 + 16/02/2018 + 23/02/2018.
I hope I have explained my problem clearly. I don't think my method is suitable for what I am trying to do. I would greatly appreciate some help / advice.
EDIT
When I update with using Quang Hoang's code, I now get this output below. Ideally, I still only want to output CERTAIN dates, which are : 02/02/2018, 09/02/2018, 16/02/2018, 23/02/2018
Date Date Date DirectionName
2018-02-02 00:00 4 0 South 250
2018-02-02 01:00 4 1 South 137
2018-02-02 02:00 4 2 South 92
2018-02-02 03:00 4 3 South 92
2018-02-02 04:00 4 4 South 110
...
2018-02-23 19:00 4 19 South 1490
2018-02-23 20:00 4 20 South 1056
2018-02-23 21:00 4 21 South 858
2018-02-23 22:00 4 22 South 783
2018-02-23 23:00 4 23 South 563
Length: 168, dtype: int64
Also, the date columns for this output are off, for example one of the columns should be hours not date
Upvotes: 0
Views: 39
Reputation: 150735
You are looking at groupby().size()
. However, there are several things you want to groupby here:
1. The hours, e.g. `9am`
2. The days of the week, e.g `Friday`
3. The months, e.g. `2018-02`
4. Possibly `DirectionName`
Let try to extract them:
df['Date'] = pd.to_datetime(df['Date'])
hours = df['Date'].dt.hour
days = df['Date'].dt.dayofweek
months = df['Date'].dt.to_period('M')
# exclude `'DirectionName'` if you want
df.groupby([months, days, hours, 'DirectionName']).size()
Upvotes: 1