How to create variables inside a for loop to assign dataframes?

Question

I am using Python 2.7 and am creating some specific summary metrics by month from a dataframe. Each average (jan_avg_prod_count, feb_avg_prod_count, etc...) is then added to an output file.

My main df:

month_num     prod_desc    prod_count
01             A            52
01             B            43
01             C            38
02             A            21
02             B            24
02             c            18

What I am after is summarizing the average prod_count from the dataframe by month. Right now, I am manually aggregating each month with the following and repeating for each month:

jan_df = df[df.month_num == '01']
jan_df = df.groupby(['prod_desc']).agg({'prod_count': np.mean})
jan_df = df.rename(columns = {'prod_count':'jan_avg_prod_count'})

What I want to do is put this in a loop so I only have one block of code. My approach is to loop through the range 1-12 and subset my main df by each month as above. However, I am stuck on how to create and assign the dataframes to a dynamic variable name.

I am trying to create a dictionary of the correct dataframes and then assign them.

month_dfs = {}
for x in range(1,13):
    month_dfs[x] = "%s_df" % calendar.month_abbr[x].lower()
    loop_month_num = '{num:02d}'.format(num = x)
    month_dfs[x] = df[df.month_num == loop_month_num]

The above generates a dictionary of correct values but the assignment to the subset dataframe does not work correctly.

If I print month_dfs I get what I expect:

jan_df
feb_df
mar_df
apr_df
may_df
jun_df
jul_df
aug_df
sep_df
oct_df
nov_df
dec_df

However, jan_df does not get created. Rather the month_dfs dictionary gets populated with the subset dataframes.

>>> month_dfs[x]
  month_num prod_desc prod_count
3        02         A         21
4        02         B         24
5        02         C         18

I understand what I am doing is not correct. I'm just not sure of how to get the monthly dataframes created. Any advice or new direction is appreciated.

Happy001 · Accepted Answer

I think doing a pivot_table on your df and obtain another dataframe with column names jan_avg_prod_count, feb_avg_prod_count, etc... is faster than loop and easier to use than dictionary.

In [35]: df['month'] = df.month_num.map(lambda n: calendar.month_abbr[int(n)].lower())+'_avg_prod_count'

In [36]: df
Out[36]: 
  month_num prod_desc  prod_count               month
0        01         A          52  jan_avg_prod_count
1        01         B          43  jan_avg_prod_count
2        01         C          38  jan_avg_prod_count
3        02         A          21  feb_avg_prod_count
4        02         B          24  feb_avg_prod_count
5        02         C          18  feb_avg_prod_count

In [37]: pt = df.pivot_table(columns='month',index='prod_desc',values='prod_count', aggfunc=sum)

In [38]: pt
Out[38]: 
month      feb_avg_prod_count  jan_avg_prod_count
prod_desc                                        
A                          21                  52
B                          24                  43
C                          18                  38

If you want to get 'jan_avg_prod_count', you can use pt like a dictionary:

In [39]: pt['jan_avg_prod_count']
Out[39]: 
prod_desc
A            52
B            43
C            38
Name: jan_avg_prod_count, dtype: int64

which gives you a pandas Series.

Loop should be avoided if you can because it's slow.

How to create variables inside a for loop to assign dataframes?

Answers (2)

Related Questions