Margaret Stark
Margaret Stark

Reputation: 1

How to split train data and validation data properly in K fold cross validation

First, as a non-English speaker, I am using a translator to solve my problem. I ask for your understanding if the sentence is awkward and difficult to read.

I try to learn data through Kfold cross validation. However, continuous errors occur in the process of dividing train data for kfold. Following code is my data set.

df_test = df_data.iloc[50001:, :] #Test set
df_use = df_data.iloc[0:50000, :] #Training set
    
x_test = df_test.drop(['upgraded'], axis = 1)
y_test = df_test['upgraded']
    
x = df_use.drop(['upgraded'], axis = 1)
y = df_use['upgraded']

And every time I try to split train data and validation data, error occurs.

for train_ix, val_ix in kfold.split(x):

    trainX, trainy = x[train_ix], y[train_ix]
    valX, valy = x[val_ix], y[val_ix]


    model, val_acc = evaluate_model(trainX, trainy, valX, valy)

I'm not sure this will help, but when I use this code, trainX, trainy = x[train_ix], y[train_ix] this error message occurs.

KeyError: "None of [Int64Index([10000, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008,\n 10009,\n ...\n 49990, 49991, 49992, 49993, 49994, 49995, 49996, 49997, 49998,\n 49999],\n dtype='int64', length=40000)] are in the [columns]"

So I switched that code like this.

for train_ix, val_ix in kfold.split(x):

  trainX, valX = x.iloc[train_ix], x.iloc[val_ix]
  trainy, valy = y.iloc[train_ix], y.iloc[val_ix]

model, val_acc = evaluate_model(trainX, trainy, valX, valy)

And this time, model, val_acc = evaluate_model(trainX, trainy, valX, valy) this code gets the error.

IndexError: index -9223372036854775808 is out of bounds for axis 1 with size 2

So I tried this code as well. (I sliced df_use with train_test_split.) Same index error occurs.

inputs = np.concatenate((x_train, x_val), axis=0)
targets = np.concatenate((y_train, y_val), axis=0)

I want to split and put the data in the right way so that the kfold cross validation model recognizes my data and can run the model. It would be very helpful if someone helped.

Upvotes: 0

Views: 2501

Answers (1)

Kedar U Shet
Kedar U Shet

Reputation: 583

You can try the following

from sklearn.model_selection import KFold

df_test = df_data.iloc[50001:, :] #Test set
df_use = df_data.iloc[0:50000, :] #Training set
    
y_test = df_test['upgraded']
x_test = df_test.drop(['upgraded'], axis = 1)
    
y = df_use['upgraded']
x = df_use.drop(['upgraded'], axis = 1)

kf = KFold(n_splits=2)

for train_index, test_index in kf.split(x):
    trainX, valX = x.take(list(train_index),axis=0), x.take(list(test_index),axis=0)
    trainy, valy = y.take(list(train_index),axis=0), y.take(list(test_index),axis=0)
model, val_acc = evaluate_model(trainX, trainy, valX, valy)

I hope this works. Please comment below if any issue faced.

Upvotes: 2

Related Questions