Reputation: 3305
I have a dataset with multiple subjects observed over time. I will be training a sequential model on it, and I need to split it into train/test by subject (study participant).
I provided my current workaround as an answer.
Example dataset:
from pydataset import data
longitudinal_study = data('Blackmoor')
longitudinal_study.head(10)
subject age exercise group
1 100 8.00 2.71 patient
2 100 10.00 1.94 patient
3 100 12.00 2.36 patient
4 100 14.00 1.54 patient
5 100 15.92 8.63 patient
6 101 8.00 0.14 patient
7 101 10.00 0.14 patient
8 101 12.00 0.00 patient
9 101 14.00 0.00 patient
10 101 16.67 5.08 patient
Expected output:
# Not Implemented
# train_df, test_df = train_test_split(longitudinal_study, by='subject', test_size=0.1)
assert len(set(train_df.subject).intersection(set(test_df.subject)))==0
I have three questions:
test_size
? What if number of observations is different for different participants?scikit-learn
or other libraries?Upvotes: 1
Views: 1140
Reputation: 6333
I would like to complement your solution by stating that rather than keeping a unique set of subjects
, it may be better to keep the last observation of each subject
and stratify on your target (or even a feature).
Both solutions will yield essentially the same result, but stratifying on the last observed period of each subject
may be important if your data becomes unbalanced with the passing of time.
# Keep last row of each subject
subjects = df.groupby('subject').last().reset_index()
# Split this data stratifying by `group`
subjects_train, subjects_test = train_test_split(subjects['subject'], train_size=0.9, test_size=0.1, stratify=subjects['group'])
And then continue as before.
Check this article in case you want to stratify by a continuous column.
Upvotes: 1
Reputation: 3305
As a workaround one can use the standard train_test_split
on unique values of the column subject
.
import pandas as pd
from sklearn.model_selection import train_test_split
subjects = longitudinal_study.subject.unique()
subjects_train, subjects_test = train_test_split(subjects, test_size=0.1)
train_df = longitudinal_study[longitudinal_study.subject.isin(subjects_train)]
test_df = longitudinal_study[longitudinal_study.subject.isin(subjects_test)]
Upvotes: 1