Working of pandas.apply() with functions

Question

I need to add few calculated columns to a panda dataframe. Some of these columns require the values to be passed to specific functions.

I came across some behavior that I did not understand. With reference to the following code snippet

from numpy.random import randn
from pandas import Dataframe

def just_sum(a,b):
    return a + b

# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))

df['reg_sum'] = df.a + df.b
#works almost instantly

df['f_sum'] = df.apply(lambda x: just_sum(x.a, x.b), axis = 1)
# takes little more thatn 30 seconds

Why is the apply method taking so much time ?
Is this the right way to do this ? If not then what is ?

PS : Somebody suggested using Cython. Will that really affect performance ?

abhinav pandey · Accepted Answer

Answering the question as there were 2 parts to it.

As @Orenshi said, the apply function doesn't take advantage of the vectorization. The right way to do this is to vectorize the function. The spippet in the question can thus be written as :

from numpy.random import randn
from numpy import vectorize
from pandas import Dataframe

def just_sum(a,b):
    return a + b

# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))

vector_sum = vectorize(just_sum)

df['f_sum'] = vector_sum(df.a, df.b)
#works almost instantly

Working of pandas.apply() with functions

Answers (2)

Related Questions