roedaroeven
roedaroeven

Reputation: 13

Pandas: difference between groups

Hi I have a panda df that looks like the following (not real data)

index   datedjourney   sequence   values
1       1              1          120
2       1              1          100
3       1              2          75
4       1              3          50
5       1              3          30
6       1              3          61
7       1              4          40
8       1              4          50 
9       2              1          86
10      2              1          40
11      2              2          70

If I group by datedjourney and sequence like this

df.groupby(['datedjourney', 'sequence'])['values']

I want to calculate the difference between the last row in a grouping and the first row in the second group so the df would look like the following

index      datedjourney   sequence   values    diff_between_groups
1           1              1          120          NaN
2           1              1          100          NaN
3           1              2          75           -25
4           1              3          50           -25
5           1              3          30           NaN
6           1              3          61           NaN
7           1              4          40           -21
8           1              4          50           NaN
9           2              1          86           NaN
10          2              1          40           NaN
11          2              2          70           30.

So the values on row 2 and row 3 should be calculated and the difference should be stored in a new column diff_between_groups on row 3. And then the difference between row 3 and 4 etc. And I don't want to calculate the difference between row 8 and 9 because that's a new datedjourney.

Upvotes: 1

Views: 466

Answers (1)

b2002
b2002

Reputation: 914

Create a "diff_between_groups" column which is the difference between "values" and "values" shifted one row.

Make a boolean mask to find the rows where "datedjourney" is the same and "sequence" is different from the row above.

Use pandas Series where function to replace "diff_between_groups" values using mask.

df = pd.DataFrame({'index': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
                   'datedjourney': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2],
                   'sequence': [1, 1, 2, 3, 3, 3, 4, 4, 1, 1, 2],
                   'values': [120, 100, 75, 50, 30, 61, 40, 50, 86, 40, 70]})

df['diff_between_groups'] = df['values'] - df['values'].shift()
mask = (df.datedjourney == df.datedjourney.shift()) & (df.sequence != df.sequence.shift())
df['diff_between_groups'] = df['diff_between_groups'].where(mask, np.nan)

print(df)

    index  datedjourney  sequence  values  diff_between_groups
0       1             1         1     120                  NaN
1       2             1         1     100                  NaN
2       3             1         2      75                -25.0
3       4             1         3      50                -25.0
4       5             1         3      30                  NaN
5       6             1         3      61                  NaN
6       7             1         4      40                -21.0
7       8             1         4      50                  NaN
8       9             2         1      86                  NaN
9      10             2         1      40                  NaN
10     11             2         2      70                 30.0

Upvotes: 2

Related Questions