abhinav pandey
abhinav pandey

Reputation: 584

Working of pandas.apply() with functions

I need to add few calculated columns to a panda dataframe. Some of these columns require the values to be passed to specific functions.

I came across some behavior that I did not understand. With reference to the following code snippet

from numpy.random import randn
from pandas import Dataframe

def just_sum(a,b):
    return a + b

# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))

df['reg_sum'] = df.a + df.b
#works almost instantly

df['f_sum'] = df.apply(lambda x: just_sum(x.a, x.b), axis = 1)
# takes little more thatn 30 seconds
  1. Why is the apply method taking so much time ?
  2. Is this the right way to do this ? If not then what is ?

PS : Somebody suggested using Cython. Will that really affect performance ?

Upvotes: 0

Views: 149

Answers (2)

abhinav pandey
abhinav pandey

Reputation: 584

Answering the question as there were 2 parts to it.

As @Orenshi said, the apply function doesn't take advantage of the vectorization. The right way to do this is to vectorize the function. The spippet in the question can thus be written as :

from numpy.random import randn
from numpy import vectorize
from pandas import Dataframe

def just_sum(a,b):
    return a + b

# 1,000,000 columns with random data
df = DataFrame(randn(1000000, 2), columns=list('ab'))

vector_sum = vectorize(just_sum)

df['f_sum'] = vector_sum(df.a, df.b)
#works almost instantly

Upvotes: 0

Orenshi
Orenshi

Reputation: 1873

The apply function doesn't take advantage of the vectorization... Every time the function is called it's creating a brand new series so for say millions of rows that's a lot of IO overhead.

Check out a Github issue and see the discussion Pandas Issue 11615

This accepted answer in this other StackOverflow post makes mention of it as well.

Pandas - Explanation on apply function being slow

Upvotes: 2

Related Questions