Reputation: 309
I am trying to implement my own cross-validation function. I read about cross-validation on this link, and was able to split my dataset into training and test. However how can I define the folds? For example my data frame looks like this.
Dataframe:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
3 143.90 81.41 80.94 -999.00 1
4 175.86 16.91 134.80 -999.00 0
5 -999.00 162.17 125.95 -999.00 0
6 143.90 81.41 80.94 -999.00 1
7 175.86 16.91 134.80 -999.00 0
8 -999.00 162.17 125.95 -999.00 0
9 143.90 81.41 80.94 -999.00 1
And want output like this:
For K=3 (Folds)
When K=1
Training:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
3 143.90 81.41 80.94 -999.00 1
4 175.86 16.91 134.80 -999.00 0
5 -999.00 162.17 125.95 -999.00 0
6 143.90 81.41 80.94 -999.00 1
Test:
7 175.86 16.91 134.80 -999.00 0
8 -999.00 162.17 125.95 -999.00 0
9 143.90 81.41 80.94 -999.00 1
When K=2
Training:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
6 143.90 81.41 80.94 -999.00 1
7 175.86 16.91 134.80 -999.00 0
8 -999.00 162.17 125.95 -999.00 0
9 143.90 81.41 80.94 -999.00 1
Test:
3 143.90 81.41 80.94 -999.00 1
4 175.86 16.91 134.80 -999.00 0
5 -999.00 162.17 125.95 -999.00 0
When K=3
Training:
MMC MET_lep MASS_Vis Pt_H Y
0 138.70 51.65 97.82 0.91 0
1 160.93 68.78 103.23 -999.00 0
2 -999.00 162.17 125.95 -999.00 0
3 143.90 81.41 80.94 -999.00 1
7 175.86 16.91 134.80 -999.00 0
8 -999.00 162.17 125.95 -999.00 0
9 143.90 81.41 80.94 -999.00 1
Test:
4 175.86 16.91 134.80 -999.00 0
5 -999.00 162.17 125.95 -999.00 0
6 143.90 81.41 80.94 -999.00 1
Below is my code, it does the job of splitting but does not do the folds:
split = math.floor(dataset.shape[0]*0.8)
data_train = dataset[:split]
data_test = dataset[split:]
Thank you in advance for helping on this.
Upvotes: 0
Views: 15585
Reputation: 31
this solution is based on pandas and numpy libraries:
import pandas as pd
import numpy as np
First you split your dataset into k parts:
k = 10
folds = np.array_split(data, k)
Then you iterate over your folds, using one as testset and the other k-1 as training, so at last you perform the fitting k times:
for i in range(k):
train = folds.copy() // you wanna work on a copy of your array
test = folds[i]
del train[i]
train = pd.concat(train, sort=False)
perform(clf, train.copy(), test.copy()) // do the fitting, here you also want to copy
In this function you remove the label column from your sets and fit the scikit-classifier (clf) and then return the prediction.
def perform(clf, train_set, test_set):
# remove labels from data
train_labels = train_set.pop('Y').values
test_labels = test_set.pop('Y').values
clf.fit(train_set, train_labels)
return clf.score(test_set, test_labels)
Upvotes: 3
Reputation: 340
Is it your intention for the K=2 fold to overlap with the K=3 test fold (3,4,5) vs (4,5,6)? Also, it seems like K is being overloaded in your example to mean both the number of folds, and the index of the current fold. In my answer, I'll use i for the i-th fold out of k total folds.
Assuming the goal is to create non-overlapping folds, it should be sufficient to have a function that produces roughly even ranges out of the range 0 to len(dataset) - 1. You can get a roughly even split even when your list is not perfectly divisible by k splitting at floor((n*i)/k). In python you could use a function like this:
def fold_i_of_k(dataset, i, k):
n = len(dataset)
return dataset[n*(i-1)//k:n*i//k]
Here is an example on a one dimensional data-set (should work just as well for a DataFrame):
>>> fold_i_of_k(list(range(0,11)),1,3)
[0, 1, 2]
>>> fold_i_of_k(list(range(0,11)),2,3)
[3, 4, 5, 6]
>>> fold_i_of_k(list(range(0,11)),3,3)
[7, 8, 9, 10]
Upvotes: 7