Reputation: 349
Why are operations on pandas.DataFrame
s so slow? Look at the following examples.
numpy.ndarray
populated with random floating point numberspandas.DataFrame
populated with the same numpy arrayThe I measure the time of the following operations
numpy.ndarray
pandas.DataFrame
pandas.DataFrame.values -> np.ndarray
numpy.ndarrays
is much faster than operating on pandas.DataFrames
.pd.DataFrame
does not contain only floating point numbers and has nothing special attached (MultiIndex or whatever).numpy.ndarray
are about 7 to 10 times faster.pandas
not able to call or pass through numpy
s' operations?import numpy as np
import pandas as pd
n = 50000
m = 5000
array = np.random.uniform(0, 1, (n, m))
dataframe = pd.DataFrame(array)
%%timeit
array.sum(axis=0)
206 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
array.sum(axis=1)
233 ms ± 33.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
dataframe.sum(axis=0)
1.65 s ± 14.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dataframe.sum(axis=1)
1.74 s ± 15.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Let's operate on the values alone ...
%%timeit
dataframe.values.sum(axis=0)
206 ms ± 7.13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
dataframe.values.sum(axis=1)
181 ms ± 1.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Upvotes: 6
Views: 6410
Reputation: 149135
Pandas uses numpy as its underlying data containers, but provide much more features. A DataFrame contains a collection of 1D numpy arrays of possibly different dtypes, along with 2 Index (one for the rows and one for the columns). Those index can even be of MultiIndex types.
All this comes at a performance cost.
The good news is that you can directly process the underlying numpy arrays at numpy level for additional performance if you do not need the fancy indexing of pandas.
Upvotes: 3