JRalston
JRalston

Reputation: 83

Pandas .min() method doesn't seem fastest

I'm trying to take mins, maxs, means, etc. of columns of my Pandas df (all numeric values of some kind) and it doesn't seem that Pandas methods are fastest. It seems like if I first hit it with .values the runtime of these operations is greatly improved. Is this desired behavior (meaning is Pandas doing something silly or intended? Perhaps I'm using up extra memory by hitting it with .values or I'm making assumptions and/or making it easier in some way that's not a given...).

"Evidence" of the unexpected behavior:

df = pd.DataFrame(np.random.randint(0,1000,size=(100000000, 4)), columns=list('ABCD'))

start = time.time()
print(df['A'].min())
print(time.time()-start)`

# 0
# 1.35876178741


start = time.time()
df['A'].values.min()
print(time.time()-start)

# 0
# 0.225932121277

start = time.time()
print(np.mean(df['A']))
print(time.time()-start)

# 499.49969672
# 1.58990907669

start = time.time()
print(df['A'].values.mean())
print(time.time()-start)

# 499.49969672
# 0.244406938553

Upvotes: 3

Views: 830

Answers (1)

G. Anderson
G. Anderson

Reputation: 5955

When you just call a column, you are reducing it to a pandas series, which is based on a numpy array but with a lot more wrapped around it. Pandas objects are optimized for spreadsheet or database-type operations like joins, lookups, etc.

When you call .values on a column, it makes it a numpy array, which is a dtype optimized for mathematical and vector operations in C. Even with the 'unwrapping' to ndarray type, the mathematical operation efficiency beats the series datatype hands-down. Here is a quick discussion on some of the differences.

As a side note, there is a specific module - timeit for these type of time comparisons

type(df['a'])

pandas.core.series.Series

%timeit df['a'].min()

6.68 ms ± 121 µs per loop

type(df['a'].values)

numpy.ndarray

%timeit df['a'].values.min()

696 µs ± 18 µs per loop

Upvotes: 3

Related Questions