Speed up dataframe operations in pandas

Question

I am currently making the Switch from R to python and wonder weither I can speed up the following dataframe operations. I have a sales dataset consisting of 500k rows and 17 columns on which I need to do some calculations before I put them into a dashboard. My data looks like this:

location  time  product  sales
store1    2017  brandA   10
store1    2017  brandB   17 
store1    2017  brandC   15
store1    2017  brandD   19
store1    2017  catTot   86
store2    2017  brandA   8
store2    2017  brandB   23 
store2    2017  brandC   5
store2    2017  brandD   12
store2    2017  catTot   76
.         .     .         .
.         .     .         .
.         .     .         .
.         .     .         .

catTot is a pre aggregate I get from the raw data set which shows the total sales for a given store in a given time period. As you can see, the other products are just a fraction of the total and never add up to the total, however they are included in the total. Since I want to reflect how the total sales in a given location is without showing all products (due to performance issues in the dashboard) I need to replace the catTotvalues with an aggregate that is actually the current value minus the sum of the other products.

Currently, I iterate through nested for loops to make the changes. The code looks like this:

df['location'] = df.location.astype('category')
df['time'] = df.time.astype('category')

var_geo = []
var_time = []
for var_time in df.time.cat.categories:
    for var_geo in df.location.cat.categories:
        df_tmp = []
        fct_eur = []

        df_tmp = df[(df['location'] == var_geo) & (df['time'] == var_time)]
        fct_eur = df_tmp.iloc[len(df_tmp)-1,3] df_tmp.iloc[0:len(df_tmp)-2,3].sum()
        df.loc[(df['location'] == var_geo) & (df['time'] == var_time) & (df['product'] == 'catTot'), ['sales']] = fct_eur

As you can see, catTot is always the last row in the masked dataframe. This operation now takes around 9min every time, since I have 23 store locations, around 880 products, 30 time periods and 5 different measures, which results in about 500k rows. Is there a more elegant or atleast faster way to make this kind of operations?

Jon Clements · Accepted Answer

You can creating a grouping key where everything not "catTot" is set to "sales", then pivot_table to aggregate the sales column, eg:

agg = df.pivot_table(
    index=['location', 'time'],
    columns=np.where(df['product'] == 'catTot', 'catTot', 'sales'),  
    values='sales', 
    aggfunc='sum'
)

This'll give you:

               catTot  sales
location time
store1   2017      86     61
store2   2017      76     48

Then you can do new_total = agg['catTot'] - agg['sales']:

location  time
store1    2017    25
store2    2017    28
dtype: int64

Speed up dataframe operations in pandas

Answers (2)

Related Questions