Reputation: 7435
I need to do a K-fold CV on some models, but I need to ensure the validation (test) data set is clustered together by a group and t
number of years. GroupKFold
is close, but it still splits up the validation set (see second fold).
For example, if I have a set of data with years from 2000-2008 and I want to K-fold into 3 groups. The appropriate sets would be: Validation: 2000-2002, Train: 2003-2008; V:2003-2005, T:2000-2002 & 2006-2008; and V: 2006-2008, T: 2000-2005).
Is there a way to group and cluster the data using K-Fold CV where the validation set is clustered by t
years?
from sklearn.model_selection import GroupKFold
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
gkf = GroupKFold(n_splits=2)
for train_index, test_index in gkf.split(X, y, groups=groups):
print("Train:", train_index, "Validation:",test_index)
Output:
Train: [ 0 1 2 3 4 5 10 11 12] Validation: [6 7 8 9]
Train: [3 4 5 6 7 8 9] Validation: [ 0 1 2 10 11 12]
Train: [ 0 1 2 6 7 8 9 10 11 12] Validation: [3 4 5]
Desired Output (assume 2 years for each group):
Train: [ 7 8 9 10 11 12 ] Validation: [0 1 2 3 4 5 6]
Train: [0 1 2 10 11 12 ] Validation: [ 3 4 5 6 7 8 9 ]
Train: [ 0 1 2 3 4 5 ] Validation: [6 7 8 9 10 11 12]
Although, the test and train subsets are not sequential along and can select more years to group.
Upvotes: 5
Views: 4745
Reputation: 1821
I hope I understood you correctly.
The LeaveOneGroupOut method from scikits model_selection
might help:
Lets say you assign the group label 0 to all the data points from 2000-2002, label 1 for all data points between 2003 and 2005 and label 2 for the data in 2006-2008. Then you could use the following method, to create training and test splits, where the three test splits are created from one of the three groups:
from sklearn.model_selection import LeaveOneGroupOut
import numpy as np
groups=[1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,3,3]
X=np.random.random(len(groups))
y=np.random.randint(0,4,len(groups))
logo = LeaveOneGroupOut()
print("n_splits=", logo.get_n_splits(X,y,groups))
for train_index, test_index in logo.split(X, y, groups):
print("train_idx:", train_index, "test_idx:", test_index)
Output:
n_splits= 3
train_idx: [ 4 5 6 7 8 9 10 11 12 13 14 15 16 17] test_idx: [0 1 2 3]
train_idx: [ 0 1 2 3 10 11 12 13 14 15 16 17] test_idx: [4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5 6 7 8 9] test_idx: [10 11 12 13 14 15 16 17]
I think I now finally understood what you want. Sorry that it took me so long.
I dont think that your desired split method is already implemented in sklearn. But we can easily extend the BaseCrossValidator method.
import numpy as np
from sklearn.model_selection import BaseCrossValidator
from sklearn.utils.validation import check_array
class GroupOfGroups(BaseCrossValidator):
def __init__(self, group_of_groups):
"""
:param group_of_groups: list with length n_splits. Each entry in the list is a list with group ids from
set(groups). In each of the n_splits splits, the groups given in the current group_of_groups sublist are used
for validation.
"""
self.group_of_groups = group_of_groups
def get_n_splits(self, X=None, y=None, groups=None):
return len(self.group_of_groups)
def _iter_test_masks(self, X=None, y=None, groups=None):
if groups is None:
raise ValueError("The 'groups' parameter should not be None.")
groups=check_array(groups, copy=True, ensure_2d=False, dtype=None)
for g in self.group_of_groups:
test_index = np.zeros(len(groups), dtype=np.bool)
for g_id in g:
test_index[groups == g_id] = True
yield test_index
The usage is quite simple. As before, we define X,y
and groups
. Additionally we define a list of lists (groups of groups) which define which groups should be used together in which test fold.
So g_of_g=[[1,2],[2,3],[3,4]]
means that groups 1 and 2 are used as test set in the first fold, while the remaining groups 3 and 4 are used for training. In fold 2, data from groups 2 and 3 are used as test set etc.
I am not quite happy with the naming "GroupOfGroups" so maybe you find something better.
Now we can test this cross validator:
X = [0.1, 0.2, 2.2, 2.4, 2.3, 4.55, 5.8, 8.8, 9, 10, 0.1, 0.2, 2.2]
y = ["a", "b", "b", "b", "c", "c", "c", "d", "d", "d", "a", "b", "b"]
groups = [1, 1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4]
g_of_g = [[1,2],[2,3],[3,4]]
gg = GroupOfGroups(g_of_g)
print("n_splits=", gg.get_n_splits(X,y,groups))
for train_index, test_index in gg.split(X, y, groups):
print("train_idx:", train_index, "test_idx:", test_index)
Output:
n_splits= 3
train_idx: [ 6 7 8 9 10 11 12] test_idx: [0 1 2 3 4 5]
train_idx: [ 0 1 2 10 11 12] test_idx: [3 4 5 6 7 8 9]
train_idx: [0 1 2 3 4 5] test_idx: [ 6 7 8 9 10 11 12]
Please keep in mind that I did not include a lot of checks and didn't do thorough testing. So verify carefully that this works for you.
Upvotes: 7