David
David

Reputation: 891

Iterating over a df in chunks, based on index

I have the following df:

dff=pd.DataFrame(index=[1]*10+[2]*10,data={'mth':list(range(1,11))*2,'pmt':[10,5,3,10,20,4,1,6,5,6]*2,'min_pmt':[5]*10+[4]*10,'Stat':[np.nan]*20,'up':[np.nan]*20,'up_cum':[np.nan]*20})

I have in it 2 different values in the index: customer 1 and customer 2, and for each one I have couple of rows, 1 row per month. I need to make some complicated calculation, which is dependent on his performances in the previous months, but independent in the other customer, as shown in the following code:

for i in dff.index.unique():
    dff_temp=dff.loc[i]
    f=0
    up_cum=0
    for r,(j,row) in enumerate(dff_temp.iterrows()):
        if (row.pmt<row.min_pmt)&(f==0):
            row['Stat']=1
            row['up']=row.pmt-row.min_pmt
            up_cum+=row['up']
            row['up_cum']=up_cum
            f=1
        elif (f==1):
            row['up']=row.pmt-row.min_pmt
            up_cum+=np.minimum(row['up'],0)
            row['up_cum']=up_cum
            row['Stat']=np.floor(row.up_cum/row.min_pmt)
            if row.Stat>0:
                f=1
            else:
                f=0
        else:
            row['Stat']=0
            row['up']=0
            row['up_cum']=0
            f=0
        dff_temp.iloc[r]=row
    
    dff.loc[[i],['Stat','up','up_cum']]=dff_temp[['Stat','up','up_cum']].values
            

        
        

What's the optimal way for iterating over each chunk of the df and making the following calculation? This one is pretty slow and I have df with 100K different customers.

Thanks for the help.

Upvotes: 1

Views: 91

Answers (2)

David
David

Reputation: 891

I was able to improve the efficiency much more using apply on the groupby object as shown below:

def calc(dff_temp):
    f=0
    up_cum=0
    for r,(j,row) in enumerate(dff_temp.iterrows()):
        if (row.pmt<row.min_pmt)&(f==0):
            row['Stat']=1
            row['up']=row.pmt-row.min_pmt
            up_cum+=row['up']
            row['up_cum']=up_cum
            f=1
        elif (f==1):
            row['up']=row.pmt-row.min_pmt
            up_cum+=np.minimum(row['up'],0)
            row['up_cum']=up_cum
            row['Stat']=np.floor(row.up_cum/row.min_pmt)
            if row.Stat>0:
                f=1
            else:
                f=0
        else:
            row['Stat']=0
            row['up']=0
            row['up_cum']=0
            f=0
        dff_temp.iloc[r]=row
    return(dff_temp)
            
dff.groupby(dff.index).apply(lambda x: calc(x))
        

Upvotes: 1

Kate Melnykova
Kate Melnykova

Reputation: 1873

I would get rid of slicing to save some time. Note that the only rolling parameters are up_cum and f, and I would store them in a dictionary.

rolling = dict()  # key=customer label, value=(current value of up_cum, value of f)
for r, (j,row) in enumerate(dff.iterrows()):
    up_cum, f = rolling.get(j, (0, 0))
    if (row.pmt<row.min_pmt) & (f==0):
        row['Stat']=1
        row['up']=row.pmt-row.min_pmt
        up_cum += row['up']
        row['up_cum']=up_cum
        f=1
    elif (f==1):
        row['up']=row.pmt-row.min_pmt
        up_cum += np.minimum(row['up'],0)
        row['up_cum']=up_cum
        row['Stat']=np.floor(row.up_cum/row.min_pmt)
        if row.Stat>0:
            f=1
        else:
            f=0
    else:
        row['Stat']=0
        row['up']=0
        row['up_cum']=0
        f=0
    dff.iloc[r] = row
    rolling[j] = (up_cum, f)

Using timeit in your sample example, I got about time decreased by a factor of two. I expect it to be even more on a larger dataset.

Upvotes: 1

Related Questions