deubNippon
deubNippon

Reputation: 139

Python Pandas performance

Someone told if you are looping in python you are doing something wrong, and I tend to agree with that, so I did some perf. analysis on my program and I'm surprised by the results:

I'm trying to retrieving the indexes of the non-Nan data of a pandas Series with dropna(), and it seems to be slower than looping :

from pandas import Series
import numpy as np
import timeit

def test1():
    s = Series([25.9,25.8,np.nan,34.8],index=['a','b','c','d'])
    return s.dropna().index

def test2():
    s = Series([25.9,25.8,np.nan,34.8],index=['a','b','c','d'])
    res = []
    for i in s.index:
        if not np.isnan(s[i]):
            res.append(i)
    return res


>>> timeit.timeit(test1,number=10000)
1.931797840017623
>>> timeit.timeit(test2,number=10000)
1.602180508842423

Am I missing something here? Or it is just because I'm returning an array instead of pandas index?

Thanks in advance

Upvotes: 2

Views: 334

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375377

These are very small Series. Try with a larger one:

In [11]: s = pd.Series([25.9,25.8,np.nan,34.8] * 1000)

In [12]: %timeit [i for i in s.index if not np.isnan(s[i])]
10 loops, best of 3: 34.9 ms per loop

In [13]: %timeit s.dropna().index
10000 loops, best of 3: 106 µs per loop

Note: I've used a list comprehension, which may be slightly faster than your impl.

Upvotes: 4

Related Questions