N8_Coder
N8_Coder

Reputation: 803

How to fix folds in sklearn?

I am applying CV in several prediction tasks and would like to use the same folds all the time for each of my parameter sets - and if possible also in different python scripts, since the performance really depends on the folds. I am working with sklearns KFold:

kf = KFold(n_splits=folds, shuffle=False, random_state=1986)

and build my folds by

for idx_split, (train_index, test_index) in enumerate(kf.split(X, Y)):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = Y[train_index], Y[test_index]

and loop over them like

for idx_alpha, alpha in enumerate([0, 0.2, 0.4, 0.6, 0.8, 1]):
    # [...]
    for idx_split, (train_index, test_index) in enumerate(kf.split(X, Y)):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = Y[train_index], Y[test_index]**

Although I choose a random_state and set a numpy seed the folds are not equal all the time. What can I do to make this happen and possibly share my folds via several python scripts?

Upvotes: 1

Views: 267

Answers (1)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210972

You seem to be reinventing the GridSearchCV ;-)

Try this approach:

from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

param_grid = dict(model__alpha=[0, 0.2, 0.4, 0.6, 0.8, 1])

model = Lasso()  # put here algorithm, that you want to use

folds = 3
# alternatively you can prepare folds yourself
#folds = KFold(n_splits=folds, shuffle=False, random_state=1986)
grid_search = GridSearchCV(model, param_grid=param_grid, cv=folds, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

y_pred = grid_search.best_estimator_.predict(X_test)

Upvotes: 2

Related Questions