ummendial
ummendial

Reputation: 93

How to subtract first and last values in grouped data for all columns in dataset using pandas

I have a timeseries dataset containing scores on scales of depression, anxiety, and trauma for patients. Data was collected at 6 time points for each patient.

mh_data.head(10)
ID    BDI   GAD   TSQ  age
1     57     9     4    22
1     36     9     4    22
1     37     9     4    22
1     38     7     3    22
1     41     8     3    22
1     39     7     3    22
2     29     14    7    35
2     27     12    6    35
2     27     11    6    35
2     23     11    3    35

I want to create a new dataset where each patient has only 1 corresponding value for each of the variables which represents the difference between the first last and recorded data point. So, it will look like this:

ID    BDI   GAD   TSQ  age
1     18     2     1    22
2     1      0     2    35
.     .      .     .    .
.     .      .     .    .
.     .      .     .    .

I've grouped the data and aggregated by first and last scores:

mhs_agg = mhs_data.groupby("ID").agg(['first','last']) 

How can I proceed or is there a more efficient way of doing this? I also have age which is a variable I don't want to be computing the difference for (as this will come out as 0 for everyone).

I've seen all of the following posts and none of the suggestions seem to work for my specific case.

How to apply "first" and "last" functions to columns while using group by in pandas?

Python/Pandas - Aggregating dataframe with first/last function without grouping

Pandas DataFrame groupby two columns and get first and last

Upvotes: 2

Views: 1919

Answers (3)

rhug123
rhug123

Reputation: 8768

This should work:

df.groupby('ID').agg(lambda x: x.iloc[0] - x.iloc[-1])

Upvotes: 2

Tyler Rosacker
Tyler Rosacker

Reputation: 102

You can pass arbitrary functions into .agg(). When you do this, your function receives each pandas series 1 at a time, already filtered down to an individual "group".

I'd probably do the whole calculation as one function instead of multiple separate steps.

import pandas as pd

my_data = pd.DataFrame(
    {'id': [1, 1, 1, 2, 2, 2], 
     'b': [0, 2, 5, 2, 4, 8],
     'c': [5, 8, 1, 4, 2, 1]})

def diff_calc(x):
    x1 = x[:1].values # Get first value
    x2 = x[-1:].values # Get last value
    
    return abs(x1 - x2)

my_data.groupby('id').agg(diff_calc)

Input

id b c
1 0 5
1 2 8
1 5 1
2 2 4
2 4 2
2 8 1

Output

id b c
1 5 4
2 6 3

ps: You may want to use .reset_index() at the end as pandas groupBy is a bit weird, and that may have been the original issue you were running into.

Upvotes: 3

Jonathan Leon
Jonathan Leon

Reputation: 5648

try:

df1 = df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID').agg('first')-df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID').agg('last')
df_final = df1.merge(df[['ID','age']].groupby('ID').agg('first'), on='ID')


    BDI  GAD  TSQ  age
ID
1    18    2    1   22
2     6    3    4   35

Second option using lambda to get the first part, then merge

df[['ID','BDI', 'GAD', 'TSQ']].groupby('ID', as_index=False).apply(lambda x: x.groupby('ID').agg('first')-x.groupby('ID').agg('last'))

Upvotes: 3

Related Questions