Reputation: 41
I have a dataset with several repeated measures over a week for each participant, which I aggregated using `groupby` and `.agg` functions to have the first and the last measurement taken as follows:
df2=df.groupby("ID").agg(['first', 'last']).stack()
Now it looks like this:
| ID | | Age | Score | Measure |
| -- | ---- | --- | ----- | ------ |
| 0 | first| 69 | 48 | 34 |
| | last | 69 | 52 | 33 |
| 1 | first| 58 | 28 | 89 |
| | last | 58 | 14 | 34 |
I want to create a dataset using first and last value aggregates to create a difference between these values for Score and Measure variables. It should look like this:
| ID | Age | Score_diff | Measure_diff |
| -- | --- | ---------- | ------------ |
| 0 | 69 | 4 | 1 |
| 1 | 58 | 14 | 55 |
I am not sure how to approach this (I'm a Python newbie). Is there anyone who could help me out?
Best wishes
Upvotes: 2
Views: 82
Reputation: 24314
import pandas as pd
import numpy as np
you can also do this by:
df=df.reset_index()
result=(df.groupby(['ID','Age'])
.agg(lambda x:abs(np.diff(x)))
.rename(columns=lambda x:x+'_diff')
.reset_index())
Now If you print result
you will get:
ID Age Score_diff Measure_diff
0 69 4 1
1 58 14 55
Upvotes: 2
Reputation: 1575
You could try the following:
import pandas as pd
df2 = df.groupby("id").agg(['first', 'last']).stack().reset_index().set_index("id")
>>> df2
level_1 Age Score Measure
id
0 first 69 48 34
0 last 69 52 33
1 first 58 28 89
1 last 58 14 34
You groupby("id")
so that you apply the following functions to each one of them and agg
them to a list.
df3 = df2.groupby("id").agg(list)
df3["Age"] = df3["Age"].apply(lambda x: x[0]) # x[0] and x[1] are equal
>>> df3
level_1 Age Score Measure
id
0 [first, last] 69 [48, 52] [34, 36]
1 [first, last] 58 [28, 14] [89, 34]
Then you apply a lambda
function to get the difference between the two values.
df3["Score_diff"] = df3["Score"].apply(lambda x: abs(x[1] - x[0]))
df3["Measure_diff"] = df3["Measure"].apply(lambda x: abs(x[1] - x[0]))
And delete the columns you don't want
df3 = df3.drop(columns=["level_1", "Score", "Measure"])
>>> df3
Age Score_diff Measure_diff
id
0 69 4 2
1 58 14 55
Upvotes: 2