ewall
ewall

Reputation: 28100

Any way to speed up this pandas comparison?

I have a Python script which is slurping up some odd log files and putting them into a pandas.DataFrame so I can do some stat analysis. Since the logs are a snapshot of processes at 5 minute intervals, when I read each file I am checking the new lines against the data entered from the last file to see if they are the same process from before (in which case I just update the time on the existing record). It works okay, but can be surprisingly slow when the individual logs get over 100,000 lines.

When I profile the performance, there are few stand-outs, but it does show a lot of time spent in this simple function, which is basically comparing a series against the rows carried-over from the previous log:

def carryover(s,df,ids):
    # see if pd.Series (s) matches any rows in pd.DataFrame (df) from the given indices (ids)
    for id in ids:
        r = df.iloc[id]
        if (r['a']==s['a'] and
            r['b']==s['b'] and
            r['c']==s['c'] and
            r['d']==s['d'] and
            r['e']==s['e'] and
            r['f']==s['f'] ):
            return id
    return None

I'd figure this is pretty efficient, since the and's are short-circuiting and all... but is there maybe a better way?

Otherwise, are there other things I can do to help this run faster? The resulting DataFrame should fit in RAM just fine, but I don't know if there are things I should be setting to ensure caching, etc. are optimal. Thanks, all!

Upvotes: 2

Views: 171

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375465

It's quite slow to iterate and lookup like this (even though it will short-circuit), most likely the speed depends on how likely it is to hit s...

A more "numpy" way would be to do this calculation on the entire array:

equals_s = df.loc[ids, ['a', 'b', 'c', 'd', 'e', 'f']] == s.loc['a', 'b', 'c', 'd', 'e', 'f']
row_equals_s = equals_s.all(axis=1)

Then the first index for which this is True is the idxmax:

row_equals_s.idxmax()

If speed is crucial, and short-circuiting is important, then it could be an idea to rewrite your function in cython, where you can iterate fast over numpy arrays.

Upvotes: 2

Related Questions