Python - How to optimize code to run faster? (lots of for loops in DataFrame)

I have a code that works with an excel file (SAP Download) quite extensively (data transformation and calculation steps).
I need to loop through all the lines (couple thousand rows) a few times. I have written a code prior that adds DataFrame columns separately, so I could do everything in one for loop that was of course quite quick, however, I had to change data source that meant change in raw data structure.
The raw data structure has 1st 3 rows empty, then a Title row comes with column names, then 2 rows empty, and the 1st column is also empty. I decided to wipe these, and assign column names and make them headers (steps below), however, since then, separately adding column names and later calculating everything in one for statement does not fill data to any of these specific columns.
How could i optimize this code?

I have deleted some calculation steps since they are quite long and make code part even less readable

#This function adds new column to the dataframe  
def NewColdfConverter(*args):  
    for i in args:   
        dfConverter[i] = '' #previously used dfConverter[i] = NaN  


#This function creates dataframe from excel file  
def DataFrameCreator(path,sheetname):  
    excelFile = pd.ExcelFile(path)  
    global readExcel  
    readExcel = pd.read_excel(excelFile,sheet_name=sheetname)  



#calling my function to create dataframe  
DataFrameCreator(filePath,sheetName)  
dfConverter = pd.DataFrame(readExcel)  
  
#dropping NA values from Orders column (right now called Unnamed)  
dfConverter.dropna(subset=['Unnamed: 1'], inplace=True)  
  
#dropping rows and deleting other unnecessary columns  
dfConverter.drop(dfConverter.head(1).index, inplace=True)  
dfConverter.drop(dfConverter.columns[[0,11,12,13,17,22,23,48]], axis = 1,inplace = True)  
  
#renaming columns from Unnamed 1: etc to proper names  
dfConverter = dfConverter.rename(columns={Unnamed 1:propername1 Unnamed 2:propername2 etc.})  
  
#calling new column function -> this Day column appears in the 1st for loop  
NewColdfConverter("Day")  

  
#example for loop that worked prior, but not working since new dataset and new header/column steps added:  
for i in range(len(dfConverter)):      
    #Day column-> floor Entry Date -1, if time is less than 5:00:00
    if(dfConverter['Time'][i] <= time(hour=5,minute=0,second=0)):  
        dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])-timedelta(days=1)  
    else:  
        dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])  

Problem is, there are many columns that build on one another, so I cannot get them in one for loop, for instance in below example I need to calculate reqsWoSetUpValue, so I can calculate requirementsValue, so I can calculate otherReqsValue, but I'm not able to do this within 1 for loop by assigning the values to the dataframecolumn[i] row, because the value will just be missing, like nothing happened.
(dfsorted is the same as dfConverter, but a sorted version of it)

#example code of getting reqsWoSetUpValue  
for i in range(len(dfSorted)):  
    reqsWoSetUpValue[i] = #calculationsteps...  
#inserting column with value  
dfSorted.insert(49,'Reqs wo SetUp',reqsWoSetUpValue)  
  
#getting requirements value with previously calculated Reqs wo SetUp column  
for i in range(len(dfSorted)):  
    requirementsValue[i] = #calc  
  
dfSorted.insert(50,'Requirements',requirementsValue)  
  
#Calculating Other Reqs value with previously calculated Requirements column.  
for i in range(len(dfSorted)):  
    otherReqsValue[i] = #calc  
  
dfSorted.insert(51,'Other Reqs',otherReqsValue)  
  

Anyone have a clue, why I cannot do this in 1 for loop anymore by 1st adding all columns by the function, like:

NewColdfConverter('Reqs wo setup','Requirements','Other reqs')  
  
#then in 1 for loop:  
for i in range(len(dfsorted)):  
    dfSorted['Reqs wo setup'] = #calculationsteps  
    dfSorted['Requirements'] = #calculationsteps  
    dfSorted['Other reqs'] = #calculationsteps  
  

Thank you

Upvotes: 1

Views: 139

Answers (2)

JuliettVictor
JuliettVictor

Reputation: 644

General comment: How to identify bottlenecks

To get started, you should try to identify which parts of the code are slow.

Method 1: time code sections using the time package

Wrap blocks of code in statements like this:

import time
t = time.time()

# do something

print("time elapsed: {:.1f} seconds".format(time.time() - t))

Method 2: use a profiler

E.g. Spyder has a built-in profiler. This allows you to check which operations are most time consuming.

Vectorize your operations

Your code will be orders of magnitude faster if you vectorize your operations. It looks like your loops are all avoidable.

For example, rather than calling pd.to_datetime on every row separately, you should call it on the entire column at once

# slow (don't do this):
for i in range(len(dfConverter)):
   dfConverter['Day'][i] = pd.to_datetime(dfConverter['Entry Date'][i])

# fast (do this instead):
dfConverter['Day'] = pd.to_datetime(dfConverter['Entry Date'])

If you want to perform an operation on a subset of rows, you can also do this in a vectorized operation by using loc:

mask = dfConverter['Time'] <= time(hour=5,minute=0,second=0)
dfConverter.loc[mask,'Day'] = pd.to_datetime(dfConverter.loc[mask,'Entry Date']) - timedelta(days=1)

Upvotes: 1

Moarram
Moarram

Reputation: 66

Not sure this would improve performance, but you could calculate the dependent columns at the same time row by row with DataFrame.iterrows()

for index, data in dfSorted.iterrows():
    dfSorted['Reqs wo setup'][index] = #calculationsteps  
    dfSorted['Requirements'][index] = #calculationsteps  
    dfSorted['Other reqs'][index] = #calculationsteps  

Upvotes: 0

Related Questions