friedrich
friedrich

Reputation: 3

Speed up dataframe operations in pandas

I am currently making the Switch from R to python and wonder weither I can speed up the following dataframe operations. I have a sales dataset consisting of 500k rows and 17 columns on which I need to do some calculations before I put them into a dashboard. My data looks like this:

location  time  product  sales
store1    2017  brandA   10
store1    2017  brandB   17 
store1    2017  brandC   15
store1    2017  brandD   19
store1    2017  catTot   86
store2    2017  brandA   8
store2    2017  brandB   23 
store2    2017  brandC   5
store2    2017  brandD   12
store2    2017  catTot   76
.         .     .         .
.         .     .         .
.         .     .         .
.         .     .         .

catTot is a pre aggregate I get from the raw data set which shows the total sales for a given store in a given time period. As you can see, the other products are just a fraction of the total and never add up to the total, however they are included in the total. Since I want to reflect how the total sales in a given location is without showing all products (due to performance issues in the dashboard) I need to replace the catTotvalues with an aggregate that is actually the current value minus the sum of the other products.

Currently, I iterate through nested for loops to make the changes. The code looks like this:

df['location'] = df.location.astype('category')
df['time'] = df.time.astype('category')

var_geo = []
var_time = []
for var_time in df.time.cat.categories:
    for var_geo in df.location.cat.categories:
        df_tmp = []
        fct_eur = []

        df_tmp = df[(df['location'] == var_geo) & (df['time'] == var_time)]
        fct_eur = df_tmp.iloc[len(df_tmp)-1,3] df_tmp.iloc[0:len(df_tmp)-2,3].sum()
        df.loc[(df['location'] == var_geo) & (df['time'] == var_time) & (df['product'] == 'catTot'), ['sales']] = fct_eur

As you can see, catTot is always the last row in the masked dataframe. This operation now takes around 9min every time, since I have 23 store locations, around 880 products, 30 time periods and 5 different measures, which results in about 500k rows. Is there a more elegant or atleast faster way to make this kind of operations?

Upvotes: 0

Views: 127

Answers (2)

friedrich
friedrich

Reputation: 3

A friend actually proposed this way of tackling my problem. This code is also his and it builds a nested directory and adds the measure to the keys for each row, but everything except the catTot is multiplied by -1. So in the end only the remainer will be kept.

for row in data:
        safe_add(mapping, row[0], int(row[1]), row[2], int(row[3]))
def safe_add(mapping, store, year, brand, count):
    if not store in mapping:
        mapping[store] = {}
    if not year in mapping[store]:
        mapping[store][year] = 0
    if brand != 'catTot':
        count = count * -1
    new_count = count + mapping[store][year]
    mapping[store][year] = new_count

After getting the nested directory, I looped once through the dictionary to get the number of rows I will need to write it out. I do this in order to be able to prepopulate an empty df and fill it up.

counter=0    
for geo in mapping.keys():
    for time in mapping[store].keys():
        counter +=1
df_annex = pd.DataFrame(data=None, index=np.arange(0, counter), columns=df.columns)
for geo in mapping.keys():
    for time in mapping[store].keys():
        df_annex.iloc[counterb, 0] = geo
        .
        .

After writing out the dictionary, I simply subset the old totals from df and concat it with the annex. This results in time of 7.88s vs. 9 min.

Upvotes: 0

Jon Clements
Jon Clements

Reputation: 142146

You can creating a grouping key where everything not "catTot" is set to "sales", then pivot_table to aggregate the sales column, eg:

agg = df.pivot_table(
    index=['location', 'time'],
    columns=np.where(df['product'] == 'catTot', 'catTot', 'sales'),  
    values='sales', 
    aggfunc='sum'
)

This'll give you:

               catTot  sales
location time
store1   2017      86     61
store2   2017      76     48

Then you can do new_total = agg['catTot'] - agg['sales']:

location  time
store1    2017    25
store2    2017    28
dtype: int64

Upvotes: 1

Related Questions