Any way to speed up this pandas comparison?

Question

I have a Python script which is slurping up some odd log files and putting them into a pandas.DataFrame so I can do some stat analysis. Since the logs are a snapshot of processes at 5 minute intervals, when I read each file I am checking the new lines against the data entered from the last file to see if they are the same process from before (in which case I just update the time on the existing record). It works okay, but can be surprisingly slow when the individual logs get over 100,000 lines.

When I profile the performance, there are few stand-outs, but it does show a lot of time spent in this simple function, which is basically comparing a series against the rows carried-over from the previous log:

def carryover(s,df,ids):
    # see if pd.Series (s) matches any rows in pd.DataFrame (df) from the given indices (ids)
    for id in ids:
        r = df.iloc[id]
        if (r['a']==s['a'] and
            r['b']==s['b'] and
            r['c']==s['c'] and
            r['d']==s['d'] and
            r['e']==s['e'] and
            r['f']==s['f'] ):
            return id
    return None

I'd figure this is pretty efficient, since the and's are short-circuiting and all... but is there maybe a better way?

Otherwise, are there other things I can do to help this run faster? The resulting DataFrame should fit in RAM just fine, but I don't know if there are things I should be setting to ensure caching, etc. are optimal. Thanks, all!

Any way to speed up this pandas comparison?

Answers (1)

Related Questions