Reputation: 3025
I am using Python 2.7 and am creating some specific summary metrics by month from a dataframe. Each average (jan_avg_prod_count, feb_avg_prod_count, etc...) is then added to an output file.
My main df:
month_num prod_desc prod_count
01 A 52
01 B 43
01 C 38
02 A 21
02 B 24
02 c 18
What I am after is summarizing the average prod_count from the dataframe by month. Right now, I am manually aggregating each month with the following and repeating for each month:
jan_df = df[df.month_num == '01']
jan_df = df.groupby(['prod_desc']).agg({'prod_count': np.mean})
jan_df = df.rename(columns = {'prod_count':'jan_avg_prod_count'})
What I want to do is put this in a loop so I only have one block of code. My approach is to loop through the range 1-12 and subset my main df by each month as above. However, I am stuck on how to create and assign the dataframes to a dynamic variable name.
I am trying to create a dictionary of the correct dataframes and then assign them.
month_dfs = {}
for x in range(1,13):
month_dfs[x] = "%s_df" % calendar.month_abbr[x].lower()
loop_month_num = '{num:02d}'.format(num = x)
month_dfs[x] = df[df.month_num == loop_month_num]
The above generates a dictionary of correct values but the assignment to the subset dataframe does not work correctly.
If I print month_dfs I get what I expect:
jan_df
feb_df
mar_df
apr_df
may_df
jun_df
jul_df
aug_df
sep_df
oct_df
nov_df
dec_df
However, jan_df does not get created. Rather the month_dfs dictionary gets populated with the subset dataframes.
>>> month_dfs[x]
month_num prod_desc prod_count
3 02 A 21
4 02 B 24
5 02 C 18
I understand what I am doing is not correct. I'm just not sure of how to get the monthly dataframes created. Any advice or new direction is appreciated.
Upvotes: 0
Views: 3340
Reputation: 6383
I think doing a pivot_table on your df
and obtain another dataframe with column names jan_avg_prod_count, feb_avg_prod_count
, etc... is faster than loop and easier to use than dictionary.
In [35]: df['month'] = df.month_num.map(lambda n: calendar.month_abbr[int(n)].lower())+'_avg_prod_count'
In [36]: df
Out[36]:
month_num prod_desc prod_count month
0 01 A 52 jan_avg_prod_count
1 01 B 43 jan_avg_prod_count
2 01 C 38 jan_avg_prod_count
3 02 A 21 feb_avg_prod_count
4 02 B 24 feb_avg_prod_count
5 02 C 18 feb_avg_prod_count
In [37]: pt = df.pivot_table(columns='month',index='prod_desc',values='prod_count', aggfunc=sum)
In [38]: pt
Out[38]:
month feb_avg_prod_count jan_avg_prod_count
prod_desc
A 21 52
B 24 43
C 18 38
If you want to get 'jan_avg_prod_count', you can use pt
like a dictionary:
In [39]: pt['jan_avg_prod_count']
Out[39]:
prod_desc
A 52
B 43
C 38
Name: jan_avg_prod_count, dtype: int64
which gives you a pandas Series.
Loop should be avoided if you can because it's slow.
Upvotes: 1
Reputation: 2640
Not sure if it fits your use case but you might want to make dictionary of dataframes where each key is the 'month'. So dataframes['jan'] gives you the dataframe for January.. etc..
If that will work then you could do something like...
months = ['jan','feb','mar', ...]
dataframes = {month:pandas.DataFrame(df[df['month_num']==index]) \
for month,index in zip(months, range(1,13)}
Upvotes: 1