I am looking to apply a simple function to a column in a Pandas Dataframe. I have done it in two different ways: 1. df['column1']=myFunction(df['column1']) 2. df['column1']=df['column1'].apply(lambda x:myFunction[x]) My dataset is not that big to be able to tell the difference but I am guessing it will have to do with speed. Can anyone explain what the difference is and which is one preferred?

pythonpandas

Max Payne

Reputation: 389

What is the difference between using .apply or passing a column of dataframe

I am looking to apply a simple function to a column in a Pandas Dataframe. I have done it in two different ways:

1.df['column1']=myFunction(df['column1'])
2.df['column1']=df['column1'].apply(lambda x:myFunction[x])

My dataset is not that big to be able to tell the difference but I am guessing it will have to do with speed.

Can anyone explain what the difference is and which is one preferred?

Upvotes: 4

Answers (1)

Julien Marrec

Reputation: 11895

1.df['column1']=myFunction(df['column1'])

Here you are defining a function to be applied on a pd.Series. You're letting pandas handle how that's going to happen.

2.df['column1']=df['column1'].apply(lambda x:myFunction[x])

Here you are applying a function on each element.

In general, option 1 will be faster than option 2. It heavily depends on what's your actual myFunction, if this one is vectorized or element by element.

Case example:

Let's create a dataframe with 2 columns and 100,000 rows (big enough to appreciate the difference in speed), and square the elements of column1:

In [1]: 
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100000,2),
                  columns=['column1','column2'])

def myFunction(s):
    return s**2

In [2]: %%timeit
    ...: myFunction(df.column1)
    ...: 
1000 loops, best of 3: 1.68 ms per loop

In [3]: %%timeit
    ...: df.column1.apply(lambda x: x**2)
    ...: 
10 loops, best of 3: 55.4 ms per loop

So here you see it's more than 30 times faster to do the operation on pd.Series rather than element by element. That's because the myFunction is vectorized.

Now, let's take an example where your myFunction is not vectorized but element by element:

In [4]: def myFunction(s):
...:     return s.apply(lambda x: x**2)
...: 

In [4]: %%timeit
    ...: myFunction(df.column1)
    ...: 
10 loops, best of 3: 53.9 ms per loop

Basically it's the same as doing a direct apply

Upvotes: 2

What is the difference between using .apply or passing a column of dataframe

Answers (1)

Related Questions