Iterating over a df in chunks, based on index

Question

I have the following df:

dff=pd.DataFrame(index=[1]*10+[2]*10,data={'mth':list(range(1,11))*2,'pmt':[10,5,3,10,20,4,1,6,5,6]*2,'min_pmt':[5]*10+[4]*10,'Stat':[np.nan]*20,'up':[np.nan]*20,'up_cum':[np.nan]*20})

I have in it 2 different values in the index: customer 1 and customer 2, and for each one I have couple of rows, 1 row per month. I need to make some complicated calculation, which is dependent on his performances in the previous months, but independent in the other customer, as shown in the following code:

for i in dff.index.unique():
    dff_temp=dff.loc[i]
    f=0
    up_cum=0
    for r,(j,row) in enumerate(dff_temp.iterrows()):
        if (row.pmt0:
                f=1
            else:
                f=0
        else:
            row['Stat']=0
            row['up']=0
            row['up_cum']=0
            f=0
        dff_temp.iloc[r]=row
    
    dff.loc[[i],['Stat','up','up_cum']]=dff_temp[['Stat','up','up_cum']].values

What's the optimal way for iterating over each chunk of the df and making the following calculation? This one is pretty slow and I have df with 100K different customers.

Thanks for the help.

David · Accepted Answer

I was able to improve the efficiency much more using apply on the groupby object as shown below:

def calc(dff_temp):
    f=0
    up_cum=0
    for r,(j,row) in enumerate(dff_temp.iterrows()):
        if (row.pmt0:
                f=1
            else:
                f=0
        else:
            row['Stat']=0
            row['up']=0
            row['up_cum']=0
            f=0
        dff_temp.iloc[r]=row
    return(dff_temp)
            
dff.groupby(dff.index).apply(lambda x: calc(x))

Iterating over a df in chunks, based on index

Answers (2)

Related Questions