LRG
LRG

Reputation: 41

in caret: creating multiple different size partitions for testing/training/validation

I'm trying to take a dataset and partition it into 3 pieces: training: 60%, testing: 20%, and validation: 20%.

part1 <- createDataPartition(fullDataSet$classe, p=0.8, list=FALSE)
validation <- fullDataSet[-part1,]
workingSet <- fullDataSet[part1,]

When I do the same thing to partition again:

inTrain <- createDataPartition(workingSet$classe, p=.75, list=FALSE)

I get the error:

Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

Is there a way to either a) create 3 partitions of different sizes or b) do a nested partition like what I tried to do? I've considered c) use sample() instead, but it's for a class in which the instructor only uses createDataPartition, and we have to show our code. Does anyone have any advice here?

Upvotes: 4

Views: 4717

Answers (2)

user3491422
user3491422

Reputation: 1

  #METHOD 1 : EQUAL SPLITS
  # allind <- sample(1:nrow(m.d),nrow(m.d))
  # #split in three parts 
  # trainind <- allind[1:round(length(allind)/3)]
  # valind <- allind[(round(length(allind)/3)+1):round(length(allind)*(2/3))]
  # testind <- allind[round(length(allind)*(2/3)+1):length(allind)]

  set.seed(1234)

 #METHOD 2 : 60-30-20 SPLIT
 allind <- sample(1:nrow(m.d),nrow(m.d))
 trainind <- allind[1:round(length(allind)*0.6)]
 valind <- allind[(round(length(allind)*0.6)+1):((round(length(allind)*0.6)+1)+    
 (round(length(allind)*0.3)))]
 testind <- allind[((round(length(allind)*0.6)+1)+
 (round(length(allind)*0.3))+1):length(allind)]
 m.dTRAIN <- m.d[trainind,]
 m.dVAL   <- m.d[valind,]
 m.dTEST  <- m.d[testind,]

Upvotes: 0

Fabiola Fern&#225;ndez
Fabiola Fern&#225;ndez

Reputation: 272

actually I was wondering the same and I came up with a non-very-elegant solution but that seems to work.

So, in my case I wanted to create a training dataset with 60% of the data and test and validation datasets with 20% each. Here is how I've done it:

set.seed(1234)
inTraining <- createDataPartition(mydata$FLAG, p=0.6, list=FALSE)
training.set <- mydata[inTraining,]
Totalvalidation.set <- mydata[-inTraining,]
# This will create another partition of the 40% of the data, so 20%-testing and 20%-validation
inValidation <- createDataPartition(Totalvalidation.set$FLAG, p=0.5, list=FALSE)
testing.set <- Totalvalidation.set[inValidation,]
validation.set <- Totalvalidation.set[-inValidation,]

It looks like it gives me the right datasets and will be testing them today. Hope it works for you and if somebody else has a more elegant answer please share! :)

Upvotes: 5

Related Questions