Melanie
Melanie

Reputation: 3

Why does my code create NA's when splitting data in R

I am trying to split a training set into two sets: training set and validation set. It does split it, but for some reason it deletes 32 lines in the validation set and puts NA's there. There were no NA's in the original dataset.

This is the code:

set.seed(123)
sample <- sample.int(n = nrow(traindata), size = floor(.2*nrow(traindata)), replace = F)
traindata <- traindata[-sample, ] #creating training set
validatiedata  <- traindata[sample, ] #creating validation set

print(traindata)
head(traindata)
tail(traindata)

print(validatiedata)
head(validatiedata)
tail(validatiedata)

I have tried using different code to split the data:

library(caTools)
set.seed(123)
split = sample.split(traindata, SplitRatio = 0.8)

# Create training and testing sets
train = subset(traindata, split == TRUE)
test = subset(traindata, split == FALSE)

dim(train); dim(test)

head(traindata)
tail(traindata)

head(validatiedata)
tail(validatiedata)

This second code is no good either. It splits the data wrong and also creates the NA's in the validation set.

Any suggestions?

Upvotes: 0

Views: 270

Answers (1)

M&#229;nsT
M&#229;nsT

Reputation: 974

You create the data frames traindata and validatiedata in the wrong order:

traindata <- traindata[-sample, ] # Removes rows from traindata
validatiedata  <- traindata[sample, ] # Tries to extract rows that no longer exists, resulting in NA:s

If you shift the order, you won't have this problem:

validatiedata  <- traindata[sample, ]
traindata <- traindata[-sample, ]

Upvotes: 1

Related Questions