Reputation: 1106
I am trying to do a calculation on a Dataframe with 791 rows and 130 columns.
The code below shows a function that takes a Dataframe with 791 chemical compounds. Entries in each column contain the fraction of the given element in the given compound. For example:
Material ... Zr ... Si ... O ...
SiO2 ... 0.0 ... 0.33 ... 0.66 ...
There are 791 compounds and the following function takes in the whole Dataframe, compares each element with a csv file with the atomic weights of elements and sums it up.
def add_molar_mass(DF):
el=pd.read_csv(path)
# el is a dataframe with a list of 40 elements with their atomic mass
DF["Molar Mass"]=np.nan
# creates an empty column "Molar Mass"
for j in range(len(DF)):
val=0
for i in range(len(el)):
val+=DF[el['symbol'].iloc[i]].iloc[j]*el['atomicMass'].iloc[i]
# el and DF share 41 columns.
# el['symbol'].iloc[i] gives the name of the i th element say "Ca"
# DF[el['symbol'].iloc[i]].iloc[j] gives the fraction of Ca for the
# jth member of DF
DF["Molar Mass"].iloc[j]=round(val,3)
However, this is painfully slow. Each row of DF
takes a second to execute. (I timed it using time()
)
Is there something that's obviously going to slow this code and if so can anyone suggest some modifications?
UPDATE:
Thank you for all the responses:
The new code snippet:
def add_molar_mass(DF):
el=pd.read_csv(path)
# el is a dataframe with a list of 40 elements with their atomic mass
DF["Molar Mass"]=np.nan
masses = np.array(el['atomicMass'], dtype=float)
DF["Molar Mass"] = ( DF[el['No']]*masses).sum(axis=1)
executes in a few ms. Definitely an improvement!
Upvotes: 1
Views: 395
Reputation: 20768
because you're iterating over each row, which is the slowest possible way you can do this...
Not the way Pandas is meant to be used.
Check this blog post on how to optimize probably 10x to 100x:
https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
Summary of the post in case it goes away:
Iterating speed from slowest to fastest:
for
loop and value lookup with .loc
or .iloc
is the slowest possible way..iterrows()
is a bit better.itertuples()
is faster than iterrows()
apply
and a lambda
function is fasterpd.Series
is faster (i.e. pass the series to the function as opposed to apply the function row by rowpd.Series
(df.col.values
) is fastest as values
is an np.array
which gives the best performance.Upvotes: 3
Reputation: 8550
In general I think you'll do best to think about how you can exploit broadcast operations rather than using explicit loops. This will generally result in cleaner easier to read code and will likely be faster (not guaranteed. It depends).
I would get the symbol
, and atomicMass
columns out of the el
frame before you start, then broadcast. Following is an example with made up data matching the shape of your data (pretty sure this is what you are trying to do anyway):
import numpy as np
import pandas as pd
columns = ['H', 'He', 'Li', 'Be', 'B', 'C', 'N', 'O', 'F', 'Ne']
compounds = pd.DataFrame(np.random.random((20,10)), columns=columns)
el = pd.DataFrame(
np.vstack((np.random.choice(columns, 5, replace=False), np.random.random(5))).transpose(),
columns=['symbol', 'atomicMass']
)
masses = np.array(el['atomicMass'], dtype=float)
compounds['Molar Mass'] = (compounds[el['symbol']]*masses).sum(axis=1)
Upvotes: 1