Mow
Mow

Reputation: 13

Perform calculation on multiple dataframe columns using a list of values without iteration

I am new to Python (using version 3.7). I have a dataframe that i create by loading a list from a csv file. I would like to update a column in the dataframe ("Score") that will hold the result of the summation of calculations performed on specific column values in the dataframe. Here is a snippet of the code:

#load library
import pandas as pd
#get the data
file_name = "c:\myfile.csv"
df = pd.read_csv(file_name)
#get the variable parameters
sVariableList = ["depth","rpm","pressure","flow_rate","lag" ]
sWeights = [.20, .20, .30, .15, .15] 
sMeans = [57.33283924063220, 7159.6003409761900, 20.270635083327700, 55.102824912342000, 90.67]
sSTD  = [101.803564244615000 , 3124.14373264349000, 32.461940805541400, 93.338695138920900, 61.273]

The dataframe contains more columns than the items listed in sVariableList. sVariable list represents just the fields that i want to perform the calculation on. What i would like to do is to calculate a score per each row - storing the value in the column "Score". Here is what i am doing right now and it gives the correct results:

#loop through the record and perform the calculation
for row in range(len(df)):
    ind = 0
    nScore = 0.0
    for fieldname in sVariableList: 

        #calculate the score
        nScore = nScore + ( sWeights[ind]*(((df.loc[row, fieldname] - sVariableMeans[ind])/sSTD[ind])**2) )
        ind = ind + 1 #move to the next variable/field index

    #set the result to the field value
    df.loc[row, "Score"] = nScore

But is very slow. I have a dataset of 900,000 records.

I found articles discussing list compression as a possible alternative to iteration but i am not familiar enough with the language to implement. Any ideas are appreciated.

thanks

Upvotes: 0

Views: 561

Answers (1)

Stef
Stef

Reputation: 30579

Do the calculation on the underlying numpy data and only assign the final result to the dataframe:

x = np.array([sWeights, sMeans, sSTD])
y = df[sVariableList].to_numpy()
df['Score'] = (x[0] * ((y - x[1]) / x[2])**2).sum(axis=1)

For 900,000 records this takes about 0.15 s on my computer.

Upvotes: 1

Related Questions