Reputation: 482
Say I have a pandas dataframe as follows
df = pd.DataFrame({'val': [30, 40, 50, 60, 70, 80, 90], 'idx': [9, 8, 7, 6, 5, 4, 3],
'category': ['a', 'a', 'b', 'b', 'c', 'c', 'c']}).set_index('idx')
Ouput:
val category
idx
9 30 a
8 40 a
7 50 b
6 60 b
5 70 c
4 80 c
3 90 c
I would like to add a new column with the difference between in each 'val' and the last 'val' of the previous category. The new column should look like this:
category diff val
idx
9 a nan 30
8 a nan 40
7 b 10 50
6 b 20 60
5 c 10 70
4 c 20 80
3 c 30 90
Currently I do something like this:
temp_df = df.groupby('category')['val'].agg('last').rename('lastVal').shift()
df = df.merge(temp_df, on='date', how='outer', right_index=True)
df['diff'] = df['val'] - df['lastVal']
However it is quite slow. Is there a better way to do this?
Upvotes: 1
Views: 705
Reputation: 402363
You can offload the mapping to pandas by setting category as the index first:
df2 = df.set_index('category')
df['diff'] = (
df2['val'] - df.groupby('category')['val'].last().shift()).to_numpy()
df
val category diff
idx
9 30 a NaN
8 40 a NaN
7 50 b 10.0
6 60 b 20.0
5 70 c 10.0
4 80 c 20.0
3 90 c 30.0
Upvotes: 5
Reputation: 2579
This is about twice the speed:
%%timeit
maxdf = df.groupby('category')['val'].last().shift()
df['diff'] = df['val'] - df['category'].map(maxdf.to_dict())
1.33 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
vs your version
%%timeit
temp_df = df.groupby('category')['val'].agg('last').rename('lastVal').shift()
df2 = df.merge(temp_df, on='category', how='outer', right_index=True)
df2['diff'] = df2['val'] - df2['lastVal']
2.79 ms ± 83.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 4