thomas
thomas

Reputation: 349

Why are simple operations on pandas.DataFrames so slow compared to the same operations on numpy.ndarrays?

Why are operations on pandas.DataFrames so slow? Look at the following examples.

Measurement:

The I measure the time of the following operations

  1. For the numpy.ndarray
  1. For the pandas.DataFrame
  1. For the pandas.DataFrame.values -> np.ndarray

Observations

Questions

  1. Why does this happen?
  2. How can this be optimized?
  3. Is pandas not able to call or pass through numpys' operations?
import numpy as np
import pandas as pd

n = 50000
m = 5000
array = np.random.uniform(0, 1, (n, m))
dataframe = pd.DataFrame(array)

Numpy

%%timeit
array.sum(axis=0)
206 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
array.sum(axis=1)
233 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Pandas

%%timeit
dataframe.sum(axis=0)
1.65 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dataframe.sum(axis=1)
1.74 s ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Pandas without Pandas

Let's operate on the values alone ...

%%timeit
dataframe.values.sum(axis=0)
206 ms ± 7.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dataframe.values.sum(axis=1)
181 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Upvotes: 6

Views: 6410

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149135

Pandas uses numpy as its underlying data containers, but provide much more features. A DataFrame contains a collection of 1D numpy arrays of possibly different dtypes, along with 2 Index (one for the rows and one for the columns). Those index can even be of MultiIndex types.

All this comes at a performance cost.

The good news is that you can directly process the underlying numpy arrays at numpy level for additional performance if you do not need the fancy indexing of pandas.

Upvotes: 3

Related Questions