Kevin
Kevin

Reputation: 3239

K-fold cross validation - save folds for different models

I am trying to train my models and validate them using sklearn's cross validation. What I want to do is use the same folds across all of my models (which will be running from different python scripts).

How can I do this? Should I save them to a file? or should I save the kfold model? or should I use the same seed?

kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)

Upvotes: 1

Views: 2775

Answers (1)

Kevin
Kevin

Reputation: 3239

Well the easiest way I found to save the folds was to simply get them from the stratified k fold split method by looping over it. Then storing it to a json file:

kfold = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed)
folds = {}
count = 1
for train, test in kfold.split(np.zeros(len(y)), y.argmax(1)):
    folds['fold_{}'.format(count)] = {}
    folds['fold_{}'.format(count)]['train'] = train.tolist()
    folds['fold_{}'.format(count)]['test'] = test.tolist()
    count += 1
print(len(folds) == n_splits)#assert we have the same number of splits
#dump folds to json
import json
with open('folds.json', 'w') as fp:
    json.dump(folds, fp)

Note 1: Argmax here is used because my y values are one hot variables so we need to get the class that is predicted/ground truth.

Now to load it from any other script:

#load to dict to be used
with open('folds.json') as f:
    kfolds = json.load(f)

From here we can easily just loop over the elements in the dict:

for key, val in kfolds.items():
    print(key)
    train = val['train']
    test = val['test']

Our json file looks like so:

{"fold_1": {"train": [193, 2405, 2895, 565, 1215, 274, 2839, 1735, 2536, 1196, 40, 2541, 980,...SNIP...830, 1032], "test": [1, 5, 6, 7, 10, 15, 20, 26, 37, 45, 52, 54, 55, 59, 60, 64, 65, 68, 74, 76, 78, 90, 100, 106, 107, 113, 122, 124, 132, 135, 141, 146,...SNIP...]}

Upvotes: 2

Related Questions