Numpy faster at sorting than Pandas

Question

this is a basic question about sorting arrays in numpy and pandas:

I realized that when I used pandas for sorting and selecting specific columns of a data frame, that it took almost twice as long when I changed the code to use numpy arrays.

What is the reason for this change in speed?

Thanks, Leon

eg. Pandas:

j = pd.DataFrame(df)         # df columns["date","I",...]
j = j.sort(["date"], ascending=False)
x = [[DATES[int(k[1]) - 1]] for k in j["date"].tolist()]
y = j["I"].tolist()

eg. Numpy:

j = np.array(df)             # df column["date"] == j[:,0]
j = np.array(sorted(j, key=lambda a_entry: a_entry[0]))
x = [[DATES[int(k[1]) - 1]] for k in j[:,0].tolist()]
y = j[:,4].tolist()          # df column["I"] == j[:,4]

ljc · Accepted Answer

https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ explains it quite nicely. pandas as a lot of overhead, compared to numpy

quote from that site: "Why is Pandas so much slower than NumPy? The short answer is that Pandas is doing a lot of stuff when you index into a Series, and it’s doing that stuff in Python."

Numpy faster at sorting than Pandas

Answers (1)

Related Questions