Ufos
Ufos

Reputation: 3305

sklearn train_test_split by subject (participant) for a longitudinal (panel) study

I have a dataset with multiple subjects observed over time. I will be training a sequential model on it, and I need to split it into train/test by subject (study participant).

I provided my current workaround as an answer.


Example dataset:

from pydataset import data
longitudinal_study = data('Blackmoor')
longitudinal_study.head(10)

   subject    age  exercise    group
1      100   8.00      2.71  patient
2      100  10.00      1.94  patient
3      100  12.00      2.36  patient
4      100  14.00      1.54  patient
5      100  15.92      8.63  patient
6      101   8.00      0.14  patient
7      101  10.00      0.14  patient
8      101  12.00      0.00  patient
9      101  14.00      0.00  patient
10     101  16.67      5.08  patient

Expected output:

# Not Implemented
# train_df, test_df = train_test_split(longitudinal_study, by='subject', test_size=0.1)
assert len(set(train_df.subject).intersection(set(test_df.subject)))==0

I have three questions:


Upvotes: 1

Views: 1140

Answers (2)

Arturo Sbr
Arturo Sbr

Reputation: 6333

I would like to complement your solution by stating that rather than keeping a unique set of subjects, it may be better to keep the last observation of each subject and stratify on your target (or even a feature).

Both solutions will yield essentially the same result, but stratifying on the last observed period of each subject may be important if your data becomes unbalanced with the passing of time.

# Keep last row of each subject
subjects = df.groupby('subject').last().reset_index()
# Split this data stratifying by `group`
subjects_train, subjects_test = train_test_split(subjects['subject'], train_size=0.9, test_size=0.1, stratify=subjects['group'])

And then continue as before.

Check this article in case you want to stratify by a continuous column.

Upvotes: 1

Ufos
Ufos

Reputation: 3305

As a workaround one can use the standard train_test_split on unique values of the column subject.

import pandas as pd
from sklearn.model_selection import train_test_split

subjects = longitudinal_study.subject.unique()
subjects_train, subjects_test = train_test_split(subjects, test_size=0.1)
train_df = longitudinal_study[longitudinal_study.subject.isin(subjects_train)]
test_df = longitudinal_study[longitudinal_study.subject.isin(subjects_test)]

Upvotes: 1

Related Questions