Reputation: 584
I need to add few calculated columns to a panda dataframe. Some of these columns require the values to be passed to specific functions.
I came across some behavior that I did not understand. With reference to the following code snippet
from numpy.random import randn
from pandas import Dataframe
def just_sum(a,b):
return a + b
# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))
df['reg_sum'] = df.a + df.b
#works almost instantly
df['f_sum'] = df.apply(lambda x: just_sum(x.a, x.b), axis = 1)
# takes little more thatn 30 seconds
PS : Somebody suggested using Cython. Will that really affect performance ?
Upvotes: 0
Views: 149
Reputation: 584
Answering the question as there were 2 parts to it.
As @Orenshi said, the apply function doesn't take advantage of the vectorization. The right way to do this is to vectorize the function. The spippet in the question can thus be written as :
from numpy.random import randn
from numpy import vectorize
from pandas import Dataframe
def just_sum(a,b):
return a + b
# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))
vector_sum = vectorize(just_sum)
df['f_sum'] = vector_sum(df.a, df.b)
#works almost instantly
Upvotes: 0
Reputation: 1873
The apply
function doesn't take advantage of the vectorization... Every time the function is called it's creating a brand new series so for say millions of rows that's a lot of IO overhead.
Check out a Github issue and see the discussion Pandas Issue 11615
This accepted answer in this other StackOverflow post makes mention of it as well.
Pandas - Explanation on apply function being slow
Upvotes: 2