Reputation: 389
I am looking to apply a simple function to a column in a Pandas Dataframe. I have done it in two different ways:
df['column1']=myFunction(df['column1'])
df['column1']=df['column1'].apply(lambda x:myFunction[x])
My dataset is not that big to be able to tell the difference but I am guessing it will have to do with speed.
Can anyone explain what the difference is and which is one preferred?
Upvotes: 4
Views: 66
Reputation: 11895
df['column1']=myFunction(df['column1'])
Here you are defining a function to be applied on a pd.Series
. You're letting pandas handle how that's going to happen.
df['column1']=df['column1'].apply(lambda x:myFunction[x])
Here you are applying a function on each element.
In general, option 1 will be faster than option 2. It heavily depends on what's your actual myFunction
, if this one is vectorized or element by element.
Case example:
Let's create a dataframe with 2 columns and 100,000 rows (big enough to appreciate the difference in speed), and square the elements of column1
:
In [1]:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(100000,2),
columns=['column1','column2'])
def myFunction(s):
return s**2
In [2]: %%timeit
...: myFunction(df.column1)
...:
1000 loops, best of 3: 1.68 ms per loop
In [3]: %%timeit
...: df.column1.apply(lambda x: x**2)
...:
10 loops, best of 3: 55.4 ms per loop
So here you see it's more than 30 times faster to do the operation on pd.Series
rather than element by element. That's because the myFunction
is vectorized.
Now, let's take an example where your myFunction
is not vectorized but element by element:
In [4]: def myFunction(s):
...: return s.apply(lambda x: x**2)
...:
In [4]: %%timeit
...: myFunction(df.column1)
...:
10 loops, best of 3: 53.9 ms per loop
Basically it's the same as doing a direct apply
Upvotes: 2