Reputation: 28100
I have a Python script which is slurping up some odd log files and putting them into a pandas.DataFrame so I can do some stat analysis. Since the logs are a snapshot of processes at 5 minute intervals, when I read each file I am checking the new lines against the data entered from the last file to see if they are the same process from before (in which case I just update the time on the existing record). It works okay, but can be surprisingly slow when the individual logs get over 100,000 lines.
When I profile the performance, there are few stand-outs, but it does show a lot of time spent in this simple function, which is basically comparing a series against the rows carried-over from the previous log:
def carryover(s,df,ids):
# see if pd.Series (s) matches any rows in pd.DataFrame (df) from the given indices (ids)
for id in ids:
r = df.iloc[id]
if (r['a']==s['a'] and
r['b']==s['b'] and
r['c']==s['c'] and
r['d']==s['d'] and
r['e']==s['e'] and
r['f']==s['f'] ):
return id
return None
I'd figure this is pretty efficient, since the and
's are short-circuiting and all... but is there maybe a better way?
Otherwise, are there other things I can do to help this run faster? The resulting DataFrame should fit in RAM just fine, but I don't know if there are things I should be setting to ensure caching, etc. are optimal. Thanks, all!
Upvotes: 2
Views: 171
Reputation: 375465
It's quite slow to iterate and lookup like this (even though it will short-circuit), most likely the speed depends on how likely it is to hit s...
A more "numpy" way would be to do this calculation on the entire array:
equals_s = df.loc[ids, ['a', 'b', 'c', 'd', 'e', 'f']] == s.loc['a', 'b', 'c', 'd', 'e', 'f']
row_equals_s = equals_s.all(axis=1)
Then the first index for which this is True is the idxmax
:
row_equals_s.idxmax()
If speed is crucial, and short-circuiting is important, then it could be an idea to rewrite your function in cython, where you can iterate fast over numpy arrays.
Upvotes: 2