Reputation: 1106

Pandas data frame - why is this code so slow?

I am trying to do a calculation on a Dataframe with 791 rows and 130 columns.

The code below shows a function that takes a Dataframe with 791 chemical compounds. Entries in each column contain the fraction of the given element in the given compound. For example:

Material ... Zr  ... Si   ...  O    ... 

SiO2     ... 0.0 ... 0.33 ...  0.66 ...

There are 791 compounds and the following function takes in the whole Dataframe, compares each element with a csv file with the atomic weights of elements and sums it up.

def add_molar_mass(DF):

  el=pd.read_csv(path)

  # el is a dataframe with a list of 40 elements with their atomic mass

  DF["Molar Mass"]=np.nan

  # creates an empty column "Molar Mass"

  for j in range(len(DF)):
     val=0
      for i in range(len(el)):
         val+=DF[el['symbol'].iloc[i]].iloc[j]*el['atomicMass'].iloc[i]

         # el and DF share 41 columns.
         # el['symbol'].iloc[i] gives the name of the i th element say "Ca"
         # DF[el['symbol'].iloc[i]].iloc[j] gives the fraction of Ca for the 
         # jth member of DF

         DF["Molar Mass"].iloc[j]=round(val,3)

However, this is painfully slow. Each row of DF takes a second to execute. (I timed it using time())

Is there something that's obviously going to slow this code and if so can anyone suggest some modifications?

UPDATE:

Thank you for all the responses:

The new code snippet:

def add_molar_mass(DF):

  el=pd.read_csv(path)

  # el is a dataframe with a list of 40 elements with their atomic mass

  DF["Molar Mass"]=np.nan

  masses = np.array(el['atomicMass'], dtype=float)
  DF["Molar Mass"] = ( DF[el['No']]*masses).sum(axis=1)

executes in a few ms. Definitely an improvement!

Upvotes: 1

Answers (2)

MrE

Reputation: 20768

because you're iterating over each row, which is the slowest possible way you can do this...

Not the way Pandas is meant to be used.

Check this blog post on how to optimize probably 10x to 100x:

https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6

Summary of the post in case it goes away:

Iterating speed from slowest to fastest:

A for loop and value lookup with .loc or .iloc is the slowest possible way.
Using .iterrows() is a bit better
Using .itertuples() is faster than iterrows()
Using apply and a lambda function is faster
Using vectorization with a pd.Series is faster (i.e. pass the series to the function as opposed to apply the function row by row
Using vectorization with the values of the pd.Series (df.col.values) is fastest as values is an np.array which gives the best performance.

Upvotes: 3

spinkus

Reputation: 8550

In general I think you'll do best to think about how you can exploit broadcast operations rather than using explicit loops. This will generally result in cleaner easier to read code and will likely be faster (not guaranteed. It depends).

I would get the symbol, and atomicMass columns out of the el frame before you start, then broadcast. Following is an example with made up data matching the shape of your data (pretty sure this is what you are trying to do anyway):

import numpy as np
import pandas as pd

columns = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne']
compounds = pd.DataFrame(np.random.random((20,10)), columns=columns)
el = pd.DataFrame(
  np.vstack((np.random.choice(columns, 5, replace=False), np.random.random(5))).transpose(),
  columns=['symbol', 'atomicMass']
)

masses = np.array(el['atomicMass'], dtype=float)
compounds['Molar Mass'] = (compounds[el['symbol']]*masses).sum(axis=1)

Upvotes: 1

Pandas data frame - why is this code so slow?

Answers (2)

Related Questions