Zee
Zee

Reputation: 91

Train Test Split sklearn based on group variable

My X is as follows: EDIT1:

Unique ID.   Exp start date.   Value.    Status.
001          01/01/2020.       4000.     Closed
001          12/01/2019        4000.     Archived
002          01/01/2020.       5000.     Closed
002          12/01/2019        5000.     Archived

I want to make sure that none of the unique IDs that were in training are included in testing. I am using sklearn test train split. Is this possible?

Upvotes: 4

Views: 4443

Answers (1)

seralouk
seralouk

Reputation: 33147

I believe you need GroupShuffleSplit (documentation here).

import numpy as np
from sklearn.model_selection import GroupShuffleSplit
X = np.ones(shape=(8, 2))
y = np.ones(shape=(8, 1))
groups = np.array([1, 1, 2, 2, 2, 3, 3, 3])
print(groups.shape)

gss = GroupShuffleSplit(n_splits=2, train_size=.7, random_state=42)

for train_idx, test_idx in gss.split(X, y, groups):
    print("TRAIN:", train_idx, "TEST:", test_idx)

TRAIN: [2 3 4 5 6 7] TEST: [0 1]
TRAIN: [0 1 5 6 7] TEST: [2 3 4]

It can be seen from above that train/test indices are created based on the groups variable.

In your case, Unique ID. should be used as groups.

Upvotes: 4

Related Questions