Reputation: 13
Hi I have a panda df that looks like the following (not real data)
index datedjourney sequence values
1 1 1 120
2 1 1 100
3 1 2 75
4 1 3 50
5 1 3 30
6 1 3 61
7 1 4 40
8 1 4 50
9 2 1 86
10 2 1 40
11 2 2 70
If I group by datedjourney and sequence like this
df.groupby(['datedjourney', 'sequence'])['values']
I want to calculate the difference between the last row in a grouping and the first row in the second group so the df
would look like the following
index datedjourney sequence values diff_between_groups
1 1 1 120 NaN
2 1 1 100 NaN
3 1 2 75 -25
4 1 3 50 -25
5 1 3 30 NaN
6 1 3 61 NaN
7 1 4 40 -21
8 1 4 50 NaN
9 2 1 86 NaN
10 2 1 40 NaN
11 2 2 70 30.
So the values on row 2 and row 3 should be calculated and the difference should be stored in a new column diff_between_groups
on row 3. And then the difference between row 3 and 4 etc. And I don't want to calculate the difference between row 8 and 9 because that's a new datedjourney
.
Upvotes: 1
Views: 466
Reputation: 914
Create a "diff_between_groups" column which is the difference between "values" and "values" shifted one row.
Make a boolean mask to find the rows where "datedjourney" is the same and "sequence" is different from the row above.
Use pandas Series where function to replace "diff_between_groups" values using mask.
df = pd.DataFrame({'index': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
'datedjourney': [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2],
'sequence': [1, 1, 2, 3, 3, 3, 4, 4, 1, 1, 2],
'values': [120, 100, 75, 50, 30, 61, 40, 50, 86, 40, 70]})
df['diff_between_groups'] = df['values'] - df['values'].shift()
mask = (df.datedjourney == df.datedjourney.shift()) & (df.sequence != df.sequence.shift())
df['diff_between_groups'] = df['diff_between_groups'].where(mask, np.nan)
print(df)
index datedjourney sequence values diff_between_groups
0 1 1 1 120 NaN
1 2 1 1 100 NaN
2 3 1 2 75 -25.0
3 4 1 3 50 -25.0
4 5 1 3 30 NaN
5 6 1 3 61 NaN
6 7 1 4 40 -21.0
7 8 1 4 50 NaN
8 9 2 1 86 NaN
9 10 2 1 40 NaN
10 11 2 2 70 30.0
Upvotes: 2