user3607022
user3607022

Reputation: 482

Pandas calculate the diff between all values in one group and the last value of the previous group

Say I have a pandas dataframe as follows

df = pd.DataFrame({'val': [30, 40, 50, 60, 70, 80, 90], 'idx': [9, 8, 7, 6, 5, 4, 3],
                     'category': ['a', 'a', 'b', 'b', 'c', 'c', 'c']}).set_index('idx')

Ouput:

      val category
idx              
9     30        a
8     40        a
7     50        b
6     60        b
5     70        c
4     80        c
3     90        c

I would like to add a new column with the difference between in each 'val' and the last 'val' of the previous category. The new column should look like this:

     category    diff  val
idx                       
9          a     nan     30
8          a     nan     40
7          b      10     50
6          b      20     60
5          c      10     70
4          c      20     80
3          c      30     90

Currently I do something like this:

temp_df = df.groupby('category')['val'].agg('last').rename('lastVal').shift()
df = df.merge(temp_df, on='date', how='outer', right_index=True)
df['diff'] = df['val'] - df['lastVal']

However it is quite slow. Is there a better way to do this?

Upvotes: 1

Views: 705

Answers (2)

cs95
cs95

Reputation: 402363

You can offload the mapping to pandas by setting category as the index first:

df2 = df.set_index('category')
df['diff'] = (
    df2['val'] - df.groupby('category')['val'].last().shift()).to_numpy()

df

     val category  diff
idx                    
9     30        a   NaN
8     40        a   NaN
7     50        b  10.0
6     60        b  20.0
5     70        c  10.0
4     80        c  20.0
3     90        c  30.0

Upvotes: 5

Ben Pap
Ben Pap

Reputation: 2579

This is about twice the speed:

%%timeit

maxdf = df.groupby('category')['val'].last().shift()
df['diff'] = df['val'] - df['category'].map(maxdf.to_dict())

1.33 ms ± 20.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

vs your version

%%timeit

temp_df = df.groupby('category')['val'].agg('last').rename('lastVal').shift()
df2 = df.merge(temp_df, on='category', how='outer', right_index=True)
df2['diff'] = df2['val'] - df2['lastVal']

2.79 ms ± 83.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Upvotes: 4

Related Questions