Reputation: 3
I am currently making the Switch from R to python and wonder weither I can speed up the following dataframe operations. I have a sales dataset consisting of 500k rows and 17 columns on which I need to do some calculations before I put them into a dashboard. My data looks like this:
location time product sales
store1 2017 brandA 10
store1 2017 brandB 17
store1 2017 brandC 15
store1 2017 brandD 19
store1 2017 catTot 86
store2 2017 brandA 8
store2 2017 brandB 23
store2 2017 brandC 5
store2 2017 brandD 12
store2 2017 catTot 76
. . . .
. . . .
. . . .
. . . .
catTot is a pre aggregate I get from the raw data set which shows the total sales for a given store in a given time period. As you can see, the other products are just a fraction of the total and never add up to the total, however they are included in the total. Since I want to reflect how the total sales in a given location is without showing all products (due to performance issues in the dashboard) I need to replace the catTot
values with an aggregate that is actually the current value minus the sum of the other products.
Currently, I iterate through nested for
loops to make the changes. The code looks like this:
df['location'] = df.location.astype('category')
df['time'] = df.time.astype('category')
var_geo = []
var_time = []
for var_time in df.time.cat.categories:
for var_geo in df.location.cat.categories:
df_tmp = []
fct_eur = []
df_tmp = df[(df['location'] == var_geo) & (df['time'] == var_time)]
fct_eur = df_tmp.iloc[len(df_tmp)-1,3] df_tmp.iloc[0:len(df_tmp)-2,3].sum()
df.loc[(df['location'] == var_geo) & (df['time'] == var_time) & (df['product'] == 'catTot'), ['sales']] = fct_eur
As you can see, catTot
is always the last row in the masked dataframe. This operation now takes around 9min every time, since I have 23 store locations, around 880 products, 30 time periods and 5 different measures, which results in about 500k rows. Is there a more elegant or atleast faster way to make this kind of operations?
Upvotes: 0
Views: 127
Reputation: 3
A friend actually proposed this way of tackling my problem. This code is also his and it builds a nested directory and adds the measure to the keys for each row, but everything except the catTot is multiplied by -1. So in the end only the remainer will be kept.
for row in data:
safe_add(mapping, row[0], int(row[1]), row[2], int(row[3]))
def safe_add(mapping, store, year, brand, count):
if not store in mapping:
mapping[store] = {}
if not year in mapping[store]:
mapping[store][year] = 0
if brand != 'catTot':
count = count * -1
new_count = count + mapping[store][year]
mapping[store][year] = new_count
After getting the nested directory, I looped once through the dictionary to get the number of rows I will need to write it out. I do this in order to be able to prepopulate an empty df and fill it up.
counter=0
for geo in mapping.keys():
for time in mapping[store].keys():
counter +=1
df_annex = pd.DataFrame(data=None, index=np.arange(0, counter), columns=df.columns)
for geo in mapping.keys():
for time in mapping[store].keys():
df_annex.iloc[counterb, 0] = geo
.
.
After writing out the dictionary, I simply subset the old totals from df and concat it with the annex. This results in time of 7.88s vs. 9 min.
Upvotes: 0
Reputation: 142146
You can creating a grouping key where everything not "catTot" is set to "sales", then pivot_table
to aggregate the sales
column, eg:
agg = df.pivot_table(
index=['location', 'time'],
columns=np.where(df['product'] == 'catTot', 'catTot', 'sales'),
values='sales',
aggfunc='sum'
)
This'll give you:
catTot sales
location time
store1 2017 86 61
store2 2017 76 48
Then you can do new_total = agg['catTot'] - agg['sales']
:
location time
store1 2017 25
store2 2017 28
dtype: int64
Upvotes: 1