user9098929
user9098929

Reputation:

Why is applying a function (using "apply") over a DataFrame much faster than over a Series?

Why is applying a function over a DataFrame so much faster than over a Series?

import time
import pandas as pd
import numpy as np

my_range = np.arange(0, 1_000_000, 1)
df = pd.DataFrame(my_range, columns=["a"])
my_time = time.time()
df["b"] = df.a.apply(lambda x: x ** 2)
print(time.time() - my_time)
# 7.199899435043335


my_range = np.arange(0, 1_000_000, 1)
df = pd.DataFrame(my_range, columns=["a"])
my_time = time.time()
df["b"] = df.apply(lambda x: x ** 2)
print(time.time() - my_time)
# 0.09276103973388672

Upvotes: 4

Views: 374

Answers (1)

MSeifert
MSeifert

Reputation: 152657

The reason for the time difference is the fact that apply on a Series calls the function on every value in the Series. But for a DataFrame it calls the function just once for each column.

>>> my_range = np.arange(0, 10, 1, )
>>> df = pd.DataFrame(my_range, columns=["a"])
>>> _ = df.a.apply(lambda x: print(x, type(x)) or x ** 2)
0 <class 'int'>
1 <class 'int'>
2 <class 'int'>
3 <class 'int'>
4 <class 'int'>
5 <class 'int'>
6 <class 'int'>
7 <class 'int'>
8 <class 'int'>
9 <class 'int'>

>>> _ = df.apply(lambda x: print(x, type(x)) or x ** 2)
0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
9    9
Name: a, dtype: int32 <class 'pandas.core.series.Series'>
[... repeated one more time ...]

I'll ignore the second call for the discussion here (according to DYZ it's pandas way of checking if it can take the fast path).

So in your case you have 2 calls (DataFrame) vs. 1_000_000 calls (Series). That already explains most of the timing difference.

Given how different they work the are not really comparable at all. If you apply the function to the whole Series it's completely different (faster):

import pandas as pd
import numpy as np

my_range = np.arange(0, 1_000_000, 1, )
df = pd.DataFrame(my_range, columns=["a"])
%timeit df.a.apply(lambda x: x ** 2)
# 765 ms ± 4.28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df.apply(lambda x: x ** 2)
# 63.2 ms ± 1.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df.a ** 2  # apply function on the whole series directly
# 10.9 ms ± 481 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 2

Related Questions