Reputation: 83
I'm trying to take min
s, max
s, mean
s, etc. of columns of my Pandas df (all numeric values of some kind) and it doesn't seem that Pandas methods are fastest. It seems like if I first hit it with .values
the runtime of these operations is greatly improved. Is this desired behavior (meaning is Pandas doing something silly or intended? Perhaps I'm using up extra memory by hitting it with .values
or I'm making assumptions and/or making it easier in some way that's not a given...).
"Evidence" of the unexpected behavior:
df = pd.DataFrame(np.random.randint(0,1000,size=(100000000, 4)), columns=list('ABCD'))
start = time.time()
print(df['A'].min())
print(time.time()-start)`
# 0
# 1.35876178741
start = time.time()
df['A'].values.min()
print(time.time()-start)
# 0
# 0.225932121277
start = time.time()
print(np.mean(df['A']))
print(time.time()-start)
# 499.49969672
# 1.58990907669
start = time.time()
print(df['A'].values.mean())
print(time.time()-start)
# 499.49969672
# 0.244406938553
Upvotes: 3
Views: 830
Reputation: 5955
When you just call a column, you are reducing it to a pandas series, which is based on a numpy array but with a lot more wrapped around it. Pandas objects are optimized for spreadsheet or database-type operations like joins, lookups, etc.
When you call .values
on a column, it makes it a numpy array, which is a dtype optimized for mathematical and vector operations in C
. Even with the 'unwrapping' to ndarray type, the mathematical operation efficiency beats the series datatype hands-down. Here is a quick discussion on some of the differences.
As a side note, there is a specific module - timeit
for these type of time comparisons
type(df['a'])
pandas.core.series.Series
%timeit df['a'].min()
6.68 ms ± 121 µs per loop
type(df['a'].values)
numpy.ndarray
%timeit df['a'].values.min()
696 µs ± 18 µs per loop
Upvotes: 3