Reputation: 139
Someone told if you are looping in python you are doing something wrong, and I tend to agree with that, so I did some perf. analysis on my program and I'm surprised by the results:
I'm trying to retrieving the indexes of the non-Nan data of a pandas Series with dropna(), and it seems to be slower than looping :
from pandas import Series
import numpy as np
import timeit
def test1():
s = Series([25.9,25.8,np.nan,34.8],index=['a','b','c','d'])
return s.dropna().index
def test2():
s = Series([25.9,25.8,np.nan,34.8],index=['a','b','c','d'])
res = []
for i in s.index:
if not np.isnan(s[i]):
res.append(i)
return res
>>> timeit.timeit(test1,number=10000)
1.931797840017623
>>> timeit.timeit(test2,number=10000)
1.602180508842423
Am I missing something here? Or it is just because I'm returning an array instead of pandas index?
Thanks in advance
Upvotes: 2
Views: 334
Reputation: 375377
These are very small Series. Try with a larger one:
In [11]: s = pd.Series([25.9,25.8,np.nan,34.8] * 1000)
In [12]: %timeit [i for i in s.index if not np.isnan(s[i])]
10 loops, best of 3: 34.9 ms per loop
In [13]: %timeit s.dropna().index
10000 loops, best of 3: 106 µs per loop
Note: I've used a list comprehension, which may be slightly faster than your impl.
Upvotes: 4