Reputation: 8335
I have some code whereby I get a 10x speedup if I do not do multiple assignment, instead assigning across multiple rows e.g.
fast:
onset = pitch_df.loc[idx, 'onset_time']
dur = pitch_df.loc[idx, 'duration']
slow:
onset, dur = pitch_df.loc[idx, ['onset_time', 'duration']]
Is there an obvious reason for this, or is a more 'pandas' way of doing what I'm doing. I'd like to assign here to make my code more readable (i.e. I'd prefer not to write .loc[...]
all over the place).
Here's a minimal working example (4x speedup here):
import pandas as pd
import numpy as np
from timeit import timeit
df = pd.DataFrame(
{'onset_time': [0, 0, 1, 2, 3, 4],
'pitch': [61, 60, 60, 61, 60, 60],
'duration': [4, 1, 1, 0.5, 0.5, 2]}
).sort_values(['onset_time', 'pitch']).reset_index(drop=True)
def foo():
for pitch, pitch_df in df.groupby('pitch'):
for iloc in range(len(pitch_df)):
idx = pitch_df.index[iloc]
onset = pitch_df.loc[idx, 'onset_time']
dur = pitch_df.loc[idx, 'duration']
note_off = onset + dur
def bar():
for pitch, pitch_df in df.groupby('pitch'):
for iloc in range(len(pitch_df)):
idx = pitch_df.index[iloc]
onset, dur = pitch_df.loc[idx, ['onset_time', 'duration']]
note_off = onset + dur
print(f'foo time: {timeit(foo, number=100)}')
print(f'bar time: {timeit(bar, number=100)}')
Image included below for easy reading.
Upvotes: 1
Views: 182
Reputation: 4138
As Poolka mentioned in a comment to your question, if you want scalar access .at
has a smaller overhead. I'm no python expert, but here is a solution that may work for you:
def foo2():
for pitch, pitch_df in df.groupby('pitch'):
for iloc in range(len(pitch_df)):
idx = pitch_df.index[iloc]
onset, dur = (pitch_df.at[idx, x] for x in ('onset_time', 'duration'))
note_off = onset + dur
foo time: 0.12590176300000167
bar time: 0.47044453300077294
foo2 time: 0.12269815599938738
Upvotes: 1