Reputation: 95
this is a basic question about sorting arrays in numpy and pandas:
I realized that when I used pandas for sorting and selecting specific columns of a data frame, that it took almost twice as long when I changed the code to use numpy arrays.
What is the reason for this change in speed?
Thanks, Leon
eg. Pandas:
j = pd.DataFrame(df) # df columns["date","I",...]
j = j.sort(["date"], ascending=False)
x = [[DATES[int(k[1]) - 1]] for k in j["date"].tolist()]
y = j["I"].tolist()
eg. Numpy:
j = np.array(df) # df column["date"] == j[:,0]
j = np.array(sorted(j, key=lambda a_entry: a_entry[0]))
x = [[DATES[int(k[1]) - 1]] for k in j[:,0].tolist()]
y = j[:,4].tolist() # df column["I"] == j[:,4]
Upvotes: 2
Views: 2898
Reputation: 938
https://penandpants.com/2014/09/05/performance-of-pandas-series-vs-numpy-arrays/ explains it quite nicely. pandas
as a lot of overhead, compared to numpy
quote from that site: "Why is Pandas so much slower than NumPy? The short answer is that Pandas is doing a lot of stuff when you index into a Series, and it’s doing that stuff in Python."
Upvotes: 1