pakpe
pakpe

Reputation: 5479

Performance difference between applying lambda vs. named function to a pandas DataFrame

I set out to determine whether lambda expressions are slower or faster than named functions when applied to a pandas DataFrame, not expecting to find a substantial difference. To my surprise, the lambda approach was substantially slower in the following example. Is this consistently true? if so, why?

import pandas as pd
import numpy as np
import cmath

df = pd.DataFrame(np.random.randint(1,100,(100000,3)), columns=['col1','col2', 'col3'])

#Named function approach:
def quad_roots(row):
    '''Function that calculates roots of a quadratic equation'''
    a = row['col1']; b = row['col2']; c = row['col3']
    dis = (b ** 2) - (4 * a * c)
    root1 = (-b - cmath.sqrt(dis)) / (2 * a)
    root2 = (-b + cmath.sqrt(dis)) / (2 * a)
    return np.round(root1,2), np.round(root2,2)
df['roots_named'] = df.apply(quad_roots, axis=1)

#Lambda approach
df['roots_lambda'] = df.apply(lambda x: ((np.round((-x['col2'] - cmath.sqrt((x['col2'] ** 2) - (4 * x['col1'] * x['col3']))) / (2 * x['col1']),2) ),
                                   (np.round((-x['col2'] - cmath.sqrt((x['col2'] ** 2) - (4 * x['col1'] * x['col3']))) / (2 * x['col1']),2) ))
                                   , axis=1)

Upvotes: 3

Views: 2942

Answers (1)

Polkaguy6000
Polkaguy6000

Reputation: 1208

In general, they are equivalent. Some argue lambda is modestly faster.

In either case, the difference here is not related to lambda versus a declared statement, but the order of operations that code is compiling the scripts. To prove it, we can declare a function using the same syntax as the lambda and compare the results.

%%timeit
df.apply(lambda x: ((np.round((-x['col2'] - cmath.sqrt((x['col2'] ** 2) - (4 * x['col1'] * x['col3']))) / (2 * x['col1']),2) ),
                               (np.round((-x['col2'] - cmath.sqrt((x['col2'] ** 2) - (4 * x['col1'] * x['col3']))) / (2 * x['col1']),2) ))
                               , axis=1)

returns

 7.78 s ± 14.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Meanwhile:

 %%timeit
 def consolidated_func(x):
     return ((np.round((-x['col2'] - cmath.sqrt((x['col2'] ** 2) - (4 * x['col1'] * x['col3']))) / (2 * x['col1']),2) ),
            (np.round((-x['col2'] - cmath.sqrt((x['col2'] ** 2) - (4 * x['col1'] * x['col3']))) / (2 * x['col1']),2) )
          )
 df.apply(consolidated_func,axis=1)

returns

 7.79 s ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

roughly the same performance.

The difference in the code boils down to how python compiles. In non-vectorized code like df.apply, Python must compile each operation individually. In the lambda example, we are forcing python to switch between the np.round and cmath.sqrt operations. Switching results in longer execution times.

To really improve things, we could vectorize most of this function. (I did not vectorize cmath.sqrt. It might be possible to find an equivalent function, but I'm assuming you used it for a reason, so I'm leaving it alone.)

 df['dis'] = (df['col2'] ** 2) - (4 * df['col1'] * df['col3'])
 df['root1'] = np.round((-df['col2'] - df['dis'].apply(cmath.sqrt)) / (2 * df['col1']).values,2)
 df['root2'] = np.round((-df['col2'] + df['dis'].apply(cmath.sqrt)) / (2 * df['col1']).values,2)
 
 df['roots_vectorized'] = df[['root1','root2']].apply(tuple,axis=1) # There is probably a better way to do this.

yields:

 792 ms ± 6.58 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Which decreased execution time by a factor of 10.

Upvotes: 6

Related Questions