Reputation: 3
I am trying to split a training set into two sets: training set and validation set. It does split it, but for some reason it deletes 32 lines in the validation set and puts NA's there. There were no NA's in the original dataset.
This is the code:
set.seed(123)
sample <- sample.int(n = nrow(traindata), size = floor(.2*nrow(traindata)), replace = F)
traindata <- traindata[-sample, ] #creating training set
validatiedata <- traindata[sample, ] #creating validation set
print(traindata)
head(traindata)
tail(traindata)
print(validatiedata)
head(validatiedata)
tail(validatiedata)
I have tried using different code to split the data:
library(caTools)
set.seed(123)
split = sample.split(traindata, SplitRatio = 0.8)
# Create training and testing sets
train = subset(traindata, split == TRUE)
test = subset(traindata, split == FALSE)
dim(train); dim(test)
head(traindata)
tail(traindata)
head(validatiedata)
tail(validatiedata)
This second code is no good either. It splits the data wrong and also creates the NA's in the validation set.
Any suggestions?
Upvotes: 0
Views: 270
Reputation: 974
You create the data frames traindata
and validatiedata
in the wrong order:
traindata <- traindata[-sample, ] # Removes rows from traindata
validatiedata <- traindata[sample, ] # Tries to extract rows that no longer exists, resulting in NA:s
If you shift the order, you won't have this problem:
validatiedata <- traindata[sample, ]
traindata <- traindata[-sample, ]
Upvotes: 1