TrevP
TrevP

Reputation: 167

pandas groupby and filling in missing frequencies

I have a dataset of events each of which occurred on a specific day. Using Pandas I have been able to aggregate these into a count of events per month using the groupby function, and then plot a graph with Matplotlib. However, in the original dataset some months do not have any events and so there is no count of events in such a month. Such months do not therefore appear on the graph, but I would like to include then somehow with their zero count

bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count()

which produces

month_year  month    
2016-01     January       9
2016-02     February      7
2016-04     April         1
2016-06     June          4
2016-07     July          1
2016-08     August        3
2016-09     September     2
2016-10     October       5
2016-11     November     17
2016-12     December      3

I have been trying to find a way of filling missing months in the dataframe generated by the groupby function with a 'count' value of 0 for, in this example, March and May..

Can anyone offer some advice on how this might be achieved. I have been trying to carry out FFill on the month column but with little success and can't work out how to add in a corresponding zero value for the missing months

Upvotes: 1

Views: 225

Answers (1)

David Erickson
David Erickson

Reputation: 16683

First of all, if bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count() is your code, then it is a series. So, let's change it to a dataframe with bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index(). Now, into the problem.

Change to date format and use pd.Grouper and change back to string format. Also add back the month column and change the formatting of the event_no column:

bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index()
bpm2['month_year'] = bpm2['month_year'].astype(str)
bpm2['month_year'] = pd.to_datetime(bpm2['month_year'])
bpm2 = bpm2.groupby([pd.Grouper(key='month_year', freq='1M')])['event_no'].first().fillna(0).astype(int).reset_index()
bpm2['month'] = bpm2['month_year'].dt.strftime('%B')
bpm2['month_year'] = bpm2['month_year'].dt.strftime('%Y-%m')
bpm2

output:

    month_year  event_no month
0   2016-01     9        January
1   2016-02     7        February
2   2016-03     0        March
3   2016-04     1        April
4   2016-05     0        May
5   2016-06     4        June
6   2016-07     1        July
7   2016-08     3        August
8   2016-09     2        September
9   2016-10     5        October
10  2016-11     17       November
11  2016-12     3        December

Upvotes: 2

Related Questions