Reputation:
Why is applying a function over a DataFrame so much faster than over a Series?
import time
import pandas as pd
import numpy as np
my_range = np.arange(0, 1_000_000, 1)
df = pd.DataFrame(my_range, columns=["a"])
my_time = time.time()
df["b"] = df.a.apply(lambda x: x ** 2)
print(time.time() - my_time)
# 7.199899435043335
my_range = np.arange(0, 1_000_000, 1)
df = pd.DataFrame(my_range, columns=["a"])
my_time = time.time()
df["b"] = df.apply(lambda x: x ** 2)
print(time.time() - my_time)
# 0.09276103973388672
Upvotes: 4
Views: 374
Reputation: 152657
The reason for the time difference is the fact that apply
on a Series
calls the function on every value in the Series
. But for a DataFrame
it calls the function just once for each column.
>>> my_range = np.arange(0, 10, 1, )
>>> df = pd.DataFrame(my_range, columns=["a"])
>>> _ = df.a.apply(lambda x: print(x, type(x)) or x ** 2)
0 <class 'int'>
1 <class 'int'>
2 <class 'int'>
3 <class 'int'>
4 <class 'int'>
5 <class 'int'>
6 <class 'int'>
7 <class 'int'>
8 <class 'int'>
9 <class 'int'>
>>> _ = df.apply(lambda x: print(x, type(x)) or x ** 2)
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
Name: a, dtype: int32 <class 'pandas.core.series.Series'>
[... repeated one more time ...]
I'll ignore the second call for the discussion here (according to DYZ it's pandas way of checking if it can take the fast path).
So in your case you have 2 calls (DataFrame) vs. 1_000_000 calls (Series). That already explains most of the timing difference.
Given how different they work the are not really comparable at all. If you apply the function to the whole Series it's completely different (faster):
import pandas as pd
import numpy as np
my_range = np.arange(0, 1_000_000, 1, )
df = pd.DataFrame(my_range, columns=["a"])
%timeit df.a.apply(lambda x: x ** 2)
# 765 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.apply(lambda x: x ** 2)
# 63.2 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df.a ** 2 # apply function on the whole series directly
# 10.9 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 2