Reputation: 93
I have a timeseries dataset containing scores on scales of depression, anxiety, and trauma for patients. Data was collected at 6 time points for each patient.
mh_data.head(10)
ID BDI GAD TSQ age
1 57 9 4 22
1 36 9 4 22
1 37 9 4 22
1 38 7 3 22
1 41 8 3 22
1 39 7 3 22
2 29 14 7 35
2 27 12 6 35
2 27 11 6 35
2 23 11 3 35
I want to create a new dataset where each patient has only 1 corresponding value for each of the variables which represents the difference between the first last and recorded data point. So, it will look like this:
ID BDI GAD TSQ age
1 18 2 1 22
2 1 0 2 35
. . . . .
. . . . .
. . . . .
I've grouped the data and aggregated by first and last scores:
mhs_agg = mhs_data.groupby("ID").agg(['first','last'])
How can I proceed or is there a more efficient way of doing this? I also have age which is a variable I don't want to be computing the difference for (as this will come out as 0 for everyone).
I've seen all of the following posts and none of the suggestions seem to work for my specific case.
How to apply "first" and "last" functions to columns while using group by in pandas?
Python/Pandas - Aggregating dataframe with first/last function without grouping
Pandas DataFrame groupby two columns and get first and last
Upvotes: 2
Views: 1919
Reputation: 8768
This should work:
df.groupby('ID').agg(lambda x: x.iloc[0] - x.iloc[-1])
Upvotes: 2
Reputation: 102
You can pass arbitrary functions into .agg()
. When you do this, your function receives each pandas series 1 at a time, already filtered down to an individual "group".
I'd probably do the whole calculation as one function instead of multiple separate steps.
import pandas as pd
my_data = pd.DataFrame(
{'id': [1, 1, 1, 2, 2, 2],
'b': [0, 2, 5, 2, 4, 8],
'c': [5, 8, 1, 4, 2, 1]})
def diff_calc(x):
x1 = x[:1].values # Get first value
x2 = x[-1:].values # Get last value
return abs(x1 - x2)
my_data.groupby('id').agg(diff_calc)
id | b | c |
---|---|---|
1 | 0 | 5 |
1 | 2 | 8 |
1 | 5 | 1 |
2 | 2 | 4 |
2 | 4 | 2 |
2 | 8 | 1 |
id | b | c |
---|---|---|
1 | 5 | 4 |
2 | 6 | 3 |
ps: You may want to use .reset_index()
at the end as pandas groupBy is a bit weird, and that may have been the original issue you were running into.
Upvotes: 3
Reputation: 5648
try:
df1 = df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID').agg('first')-df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID').agg('last')
df_final = df1.merge(df[['ID','age']].groupby('ID').agg('first'), on='ID')
BDI GAD TSQ age
ID
1 18 2 1 22
2 6 3 4 35
Second option using lambda to get the first part, then merge
df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID', as_index=False).apply(lambda x: x.groupby('ID').agg('first')-x.groupby('ID').agg('last'))
Upvotes: 3