Lion_chocolatebar
Lion_chocolatebar

Reputation: 95

Numpy faster at sorting than Pandas

this is a basic question about sorting arrays in numpy and pandas:

I realized that when I used pandas for sorting and selecting specific columns of a data frame, that it took almost twice as long when I changed the code to use numpy arrays.

What is the reason for this change in speed?

Thanks, Leon

eg. Pandas:

j = pd.DataFrame(df)         # df columns["date","I",...]
j = j.sort(["date"], ascending=False)
x = [[DATES[int(k[1]) - 1]] for k in j["date"].tolist()]
y = j["I"].tolist()

eg. Numpy:

j = np.array(df)             # df column["date"] == j[:,0]
j = np.array(sorted(j, key=lambda a_entry: a_entry[0]))
x = [[DATES[int(k[1]) - 1]] for k in j[:,0].tolist()]
y = j[:,4].tolist()          # df column["I"] == j[:,4] 

Upvotes: 2

Views: 2898

Answers (1)

ljc
ljc

Reputation: 938

https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ explains it quite nicely. pandas as a lot of overhead, compared to numpy

quote from that site: "Why is Pandas so much slower than NumPy? The short answer is that Pandas is doing a lot of stuff when you index into a Series, and it’s doing that stuff in Python."

Upvotes: 1

Related Questions