How to improve the calculation's speed in python?

Question

I'm building a calculation to add a new column to my dataframe. Here is my data:

I need to create a new column "mob". The calculation of "mob" is that

if the "LoanID" of a certain row is the same as the one of the previous row. For example, if loan['LoanId'][0] = loan['LoanId'] 1;
if the "mob" the previous row is >0; if so, then the "mob" value of this row will add 1 from the value of previous row; if not, try if the loan['repay_lbl'] of the row is 1 or 2, if so, the "mob" value of the row will be 1;

My code is as below:

for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
    if loan['mob'][i-1] > 0:
        loan['mob'][i] = loan['mob'][i-1] +1 
    elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
        loan['mob'][i] = 1

The code will cost O(n). Is there any way to improve the algorithm and speed up? I'm just a beginner of Python. I would appreciate so much for your help.

DarrylG · Accepted Answer

Improving Time by Changing Looping Method

Improving loop time based upon

Looping through all N rows without broadcasting, so complexity is O(N)
While all are order N, different looping methods have different complexity scaling factors
The different scaling factors make some methods much faster than others

Inspired by - Different ways to iterate over rows in a Pandas Dataframe — performance comparison

Methods

For loop -- original post
iterrows
itertuples
zip

Summary

The zip method was 93x faster than for loop (i.e. OP method) for 100K rows

Test Code

import pandas as pd
import numpy as np
from random import randint

def create_input(N):
    ' Creates a loan DataFrame with N rows '
    LoanId = [randint(0, N //4) for _ in range(N)]  # though random, N//4 ensures
                                                    # high likelihood some rows repeat
                                                    # LoanID
    repay_lbl = [randint(0, 2) for _ in range(N)]

    data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
    return pd.DataFrame(data)

def m_itertuples(loan):
    ' Iterating using itertuples, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row.LoanId:
                if prev_mob > 0:
                    loan.at[row.Index, 'mob'] = prev_mob + 1 
                elif row.repay_lbl == 1 or row.repay_lbl == 2:
                    loan.at[row.Index, 'mob'] = 1
            
        # Query for latest values   
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
                    
    return loan
    
def m_for_loop(loan):
    ' For loop over the data frame '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
            
    for i in range(1,len(loan['LoanId'])):
        if loan['LoanId'][i-1] == loan['LoanId'][i]:
            if loan['mob'][i-1] > 0:
                loan['mob'][i] = loan['mob'][i-1] +1 
            elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
                loan['mob'][i] = 1
    return loan

def m_iterrows(loan):
    ' Iterating using iterrows, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob = None, None
    for index, row in loan.iterrows(): # iterate over rows with iterrows()
        if prev_loanID is not None:
             if prev_loanID == row['LoanId']:
                if prev_mob > 0:
                    loan.at[index, 'mob'] = prev_mob + 1 
                elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
                    loan.at[index, 'mob'] = 1
                    
        # Query for latest values          
        prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
        
    return loan

def m_zip(loan):
    ' Iterating using zip, set single values using at '
    loan = loan.copy()  # copy since timing calls function multiple time
                        # so don't want to modify input
                        # not necessary in general
    prev_loanID, prev_mob  = None, None
    for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
        if prev_loanID is not None:
             if prev_loanID == loanID:
                if prev_mob > 0:
                    mob = loan.at[index, 'mob'] = prev_mob + 1
                elif repay_lbl == 1 or repay_lbl == 2:
                    mob = loan.at[index, 'mob'] = 1
        
        # Update to latest values
        prev_loanID, prev_mob = loanID, mob
        
    return loan

Note: Iterator code queried dataframe for updated data rather than getting from iterator do to warning:

You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

Also compared DataFrames using assert df1.equals(df2) to verify the different methods produced identical results

Timing Code

Using benchit

inputs = [create_input(i) for i in 10**np.arange(6)]  # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]

t = benchit.timings(funcs, inputs)

Results

Run time in seconds

Functions  m_for_loop  m_iterrows  m_itertuples     m_zip
Len                                                      
1            0.000217    0.000493      0.000781  0.000327
10           0.001070    0.002002      0.001008  0.000353
100          0.007100    0.016501      0.003062  0.000498
1000         0.056940    0.162423      0.021396  0.001057
10000        0.565809    1.625043      0.210858  0.006938
100000       5.890920   16.658842      2.179602  0.062953

How to improve the calculation's speed in python?

Answers (2)

Related Questions

How to improve the calculation&#39;s speed in python?

Answers (2)

Related Questions

How to improve the calculation's speed in python?