Reputation: 891
I have the following df:
dff=pd.DataFrame(index=[1]*10+[2]*10,data={'mth':list(range(1,11))*2,'pmt':[10,5,3,10,20,4,1,6,5,6]*2,'min_pmt':[5]*10+[4]*10,'Stat':[np.nan]*20,'up':[np.nan]*20,'up_cum':[np.nan]*20})
I have in it 2 different values in the index: customer 1 and customer 2, and for each one I have couple of rows, 1 row per month. I need to make some complicated calculation, which is dependent on his performances in the previous months, but independent in the other customer, as shown in the following code:
for i in dff.index.unique():
dff_temp=dff.loc[i]
f=0
up_cum=0
for r,(j,row) in enumerate(dff_temp.iterrows()):
if (row.pmt<row.min_pmt)&(f==0):
row['Stat']=1
row['up']=row.pmt-row.min_pmt
up_cum+=row['up']
row['up_cum']=up_cum
f=1
elif (f==1):
row['up']=row.pmt-row.min_pmt
up_cum+=np.minimum(row['up'],0)
row['up_cum']=up_cum
row['Stat']=np.floor(row.up_cum/row.min_pmt)
if row.Stat>0:
f=1
else:
f=0
else:
row['Stat']=0
row['up']=0
row['up_cum']=0
f=0
dff_temp.iloc[r]=row
dff.loc[[i],['Stat','up','up_cum']]=dff_temp[['Stat','up','up_cum']].values
What's the optimal way for iterating over each chunk of the df and making the following calculation? This one is pretty slow and I have df with 100K different customers.
Thanks for the help.
Upvotes: 1
Views: 91
Reputation: 891
I was able to improve the efficiency much more using apply on the groupby object as shown below:
def calc(dff_temp):
f=0
up_cum=0
for r,(j,row) in enumerate(dff_temp.iterrows()):
if (row.pmt<row.min_pmt)&(f==0):
row['Stat']=1
row['up']=row.pmt-row.min_pmt
up_cum+=row['up']
row['up_cum']=up_cum
f=1
elif (f==1):
row['up']=row.pmt-row.min_pmt
up_cum+=np.minimum(row['up'],0)
row['up_cum']=up_cum
row['Stat']=np.floor(row.up_cum/row.min_pmt)
if row.Stat>0:
f=1
else:
f=0
else:
row['Stat']=0
row['up']=0
row['up_cum']=0
f=0
dff_temp.iloc[r]=row
return(dff_temp)
dff.groupby(dff.index).apply(lambda x: calc(x))
Upvotes: 1
Reputation: 1873
I would get rid of slicing to save some time. Note that the only rolling parameters are up_cum
and f
, and I would store them in a dictionary.
rolling = dict() # key=customer label, value=(current value of up_cum, value of f)
for r, (j,row) in enumerate(dff.iterrows()):
up_cum, f = rolling.get(j, (0, 0))
if (row.pmt<row.min_pmt) & (f==0):
row['Stat']=1
row['up']=row.pmt-row.min_pmt
up_cum += row['up']
row['up_cum']=up_cum
f=1
elif (f==1):
row['up']=row.pmt-row.min_pmt
up_cum += np.minimum(row['up'],0)
row['up_cum']=up_cum
row['Stat']=np.floor(row.up_cum/row.min_pmt)
if row.Stat>0:
f=1
else:
f=0
else:
row['Stat']=0
row['up']=0
row['up_cum']=0
f=0
dff.iloc[r] = row
rolling[j] = (up_cum, f)
Using timeit
in your sample example, I got about time decreased by a factor of two. I expect it to be even more on a larger dataset.
Upvotes: 1