Jeffrey
Jeffrey

Reputation: 41

K fold cross validation in R

As I know, the k fold cross validation is to partition the training dataset into k equal subsets and each subset is different. The R code for k-fold validation which is from R-bloggers is attached below. This data have 506 obs. and 14 variables. According to the code, they used 10 folds. My question is that if each fold has the different subset or has some repeated data points in each fold. I wanna make sure to test each data points without repeating, so my goal is to get each fold has different data points.

set.seed(450)
cv.error <- NULL
k <- 10

library(plyr) 
pbar <- create_progress_bar('text')
pbar$init(k)

for(i in 1:k){
index <- sample(1:nrow(data),round(0.9*nrow(data)))
train.cv <- scaled[index,]
test.cv <- scaled[-index,]

nn <- neuralnet(f,data=train.cv,hidden=c(5,2),linear.output=T)

pr.nn <- compute(nn,test.cv[,1:13])
pr.nn <- pr.nn$net.result*(max(data$medv)-min(data$medv))+min(data$medv)

test.cv.r <- (test.cv$medv)*(max(data$medv)-min(data$medv))+min(data$medv)

cv.error[i] <- sum((test.cv.r - pr.nn)^2)/nrow(test.cv)

pbar$step()
}

Upvotes: 0

Views: 3412

Answers (2)

alan ocallaghan
alan ocallaghan

Reputation: 3038

That is not K-fold cross validation; with each fold, a new random sample is chosen, rather than assigning the samples into K folds up front and then cycling through, assigning each fold the test set in turn.

set.seed(450)
cv.error <- NULL
k <- 10

library(plyr) 
pbar <- create_progress_bar('text')
pbar$init(k)

## Assign samples to K folds initially
index <- sample(letters[seq_len(k)], nrow(data), replace=TRUE)
for(i in seq_len(k)) {
    ## Make all samples assigned current letter the test set
    test_ind <- index == letters[[k]]
    test.cv <- scaled[test_ind, ]
    ## All other samples are assigned to the training set
    train.cv <- scaled[!test_ind, ]

    ## It is bad practice to use T instead of TRUE, 
    ## since T is not a reserved variable, and can be overwritten
    nn <- neuralnet(f,data=train.cv,hidden=c(5,2),linear.output=TRUE)

    pr.nn <- compute(nn,test.cv[,1:13])
    pr.nn <- pr.nn$net.result*(max(data$medv)-min(data$medv))+min(data$medv)

    test.cv.r <- (test.cv$medv) * (max(data$medv) - min(data$medv)) + min(data$medv)

    cv.error[i] <- sum((test.cv.r - pr.nn) ^ 2) / nrow(test.cv)

    pbar$step()
}

Then, to produce error estimates with less variance, I would repeat this process multiple times and visualise the distribution of cross-validation error across repeated assays. I think you would be better off using a package which accomplishes tasks like this for you, such as the excellent caret.

Upvotes: 0

BJK
BJK

Reputation: 153

you can shuffle the whole population from outside of the loop. the following code might give you an idea to solve the problem.

set.seed(450)
cv.error <- NULL
k <- 10

library(plyr) 
pbar <- create_progress_bar('text')
pbar$init(k)

total_index<-sample(1:nrows(data),nrows(data)) 
    ## shuffle the whole index of samples

for(i in 1:k){
index<-total_index[(i*(k-1)+1):(i*(k-1)+k)] 
    ## pick the k samples from (i*(k-1)+1) to (i*(k-1)+k).
    ## so you can avoid of picking overlapping data point in other validation set
train.cv <- scaled[-index,] ## pick the samples not in the index(-validation)
test.cv <- scaled[index,]  ## pick the k samples for validation.

nn <- neuralnet(f,data=train.cv,hidden=c(5,2),linear.output=T)

pr.nn <- compute(nn,test.cv[,1:13])
pr.nn <- pr.nn$net.result*(max(data$medv)-min(data$medv))+min(data$medv)

test.cv.r <- (test.cv$medv)*(max(data$medv)-min(data$medv))+min(data$medv)

cv.error[i] <- sum((test.cv.r - pr.nn)^2)/nrow(test.cv)

pbar$step()
}

Upvotes: 1

Related Questions