Reputation: 45
I'm building a calculation to add a new column to my dataframe. Here is my data:
I need to create a new column "mob". The calculation of "mob" is that
My code is as below:
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
The code will cost O(n). Is there any way to improve the algorithm and speed up? I'm just a beginner of Python. I would appreciate so much for your help.
Upvotes: 1
Views: 886
Reputation: 17156
Improving Time by Changing Looping Method
Improving loop time based upon
Inspired by - Different ways to iterate over rows in a Pandas Dataframe — performance comparison
Methods
Summary
The zip method was 93x faster than for loop (i.e. OP method) for 100K rows
Test Code
import pandas as pd
import numpy as np
from random import randint
def create_input(N):
' Creates a loan DataFrame with N rows '
LoanId = [randint(0, N //4) for _ in range(N)] # though random, N//4 ensures
# high likelihood some rows repeat
# LoanID
repay_lbl = [randint(0, 2) for _ in range(N)]
data = {'LoanId':LoanId, 'repay_lbl': repay_lbl, 'mob':[0]*N}
return pd.DataFrame(data)
def m_itertuples(loan):
' Iterating using itertuples, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in enumerate(loan.itertuples()): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row.LoanId:
if prev_mob > 0:
loan.at[row.Index, 'mob'] = prev_mob + 1
elif row.repay_lbl == 1 or row.repay_lbl == 2:
loan.at[row.Index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_for_loop(loan):
' For loop over the data frame '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
for i in range(1,len(loan['LoanId'])):
if loan['LoanId'][i-1] == loan['LoanId'][i]:
if loan['mob'][i-1] > 0:
loan['mob'][i] = loan['mob'][i-1] +1
elif loan['repay_lbl'][i] == 1 or loan['repay_lbl'][i] == 2:
loan['mob'][i] = 1
return loan
def m_iterrows(loan):
' Iterating using iterrows, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, row in loan.iterrows(): # iterate over rows with iterrows()
if prev_loanID is not None:
if prev_loanID == row['LoanId']:
if prev_mob > 0:
loan.at[index, 'mob'] = prev_mob + 1
elif row['repay_lbl'] == 1 or row['repay_lbl'] == 2:
loan.at[index, 'mob'] = 1
# Query for latest values
prev_loanID, prev_mob = loan.at[index, 'LoanId'], loan.at[index, 'mob']
return loan
def m_zip(loan):
' Iterating using zip, set single values using at '
loan = loan.copy() # copy since timing calls function multiple time
# so don't want to modify input
# not necessary in general
prev_loanID, prev_mob = None, None
for index, (loanID, mob, repay_lbl) in enumerate(zip(loan['LoanId'], loan['mob'], loan['repay_lbl'])):
if prev_loanID is not None:
if prev_loanID == loanID:
if prev_mob > 0:
mob = loan.at[index, 'mob'] = prev_mob + 1
elif repay_lbl == 1 or repay_lbl == 2:
mob = loan.at[index, 'mob'] = 1
# Update to latest values
prev_loanID, prev_mob = loanID, mob
return loan
Note: Iterator code queried dataframe for updated data rather than getting from iterator do to warning:
You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.
Also compared DataFrames using assert df1.equals(df2)
to verify the different methods produced identical results
Timing Code
Using benchit
inputs = [create_input(i) for i in 10**np.arange(6)] # 1 to 10^5 rows
funcs = [m_for_loop, m_iterrows, m_itertuples, m_zip]
t = benchit.timings(funcs, inputs)
Results
Run time in seconds
Functions m_for_loop m_iterrows m_itertuples m_zip
Len
1 0.000217 0.000493 0.000781 0.000327
10 0.001070 0.002002 0.001008 0.000353
100 0.007100 0.016501 0.003062 0.000498
1000 0.056940 0.162423 0.021396 0.001057
10000 0.565809 1.625043 0.210858 0.006938
100000 5.890920 16.658842 2.179602 0.062953
Upvotes: 1
Reputation: 43495
Since the value of the mob
column for each row depends on that of the previous row, it depends on all previous rows. That means that you can't run this in parallel and you're basically stuck with O(n)
.
So I don't think that numpy array operations are going to be of much use here.
Failing that, there is the usual bag of tricks to speed up Python code;
I'm not sure if the first two work well with numpy/pandas. You might have to use normal Python lists for your data in those cases.
Of course before you dive into any of these, you should consider whether your data set is large enough to warrant the effort.
Upvotes: 1