Reputation: 45

Splitting a data frame into training and testing sets in R

I have the following data.frame:

>str(customerduration_data)

Classes 'tbl_df', 'tbl' and 'data.frame':   4495 obs. of  4 variables:

$ monthofgateOUT    : Ord.factor w/ 4 levels "8"<"9"<"10"<"11": 1 1 1 1 1 1 1 1 1 1 ...

$ dayofgateOUT      : Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 4 5 1 1 1 1 1 2 2 3 ...

$ timeofgateOUT     : Ord.factor w/ 20 levels "3"<"4"<"5"<"6"<..: 13 4 2 3 3 11 15 10 13 14 ...

$ durationCUST_hours: num  95.63 5.73 10.73 10.2 14.4 .

I want to split this data into a training and a test set, using the following command:

install.packages("caTools")

library (caTools)

set.seed(6)

customerduration_data$spl=sample.split(customerduration_data,SplitRatio=0.7)

However, after running the above command the following error occurs:

>Error in `$<-.data.frame`(`*tmp*`, spl, value = c(TRUE, FALSE, FALSE,  : 
  replacement has 4 rows, data has 4495

How can I solve this problem?

Upvotes: 2

Answers (4)

Len Greski

Reputation: 10875

Here is an alternative using the caret package, and its createDataPartition() function. We'll use the Alzheimer Disease data from the Applied Predictive Modeling package to illustrate creation of test and training data sets.

library(AppliedPredictiveModeling)
library(caret)
data(AlzheimerDisease)
adData <- data.frame(diagnosis, predictors)
# count rows in data frame
nrow(adData)
trainIndex <- createDataPartition(diagnosis, p = .75,list=FALSE)
training <- adData[trainIndex,]
testing <- adData[-trainIndex,]
# rows in training data frame
nrow(training)
# rows in testing data frame 
nrow(testing)

...and the output:

> library(AppliedPredictiveModeling)
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> data(AlzheimerDisease)
> adData <- data.frame(diagnosis, predictors)
> # count rows in data frame
> nrow(adData)
[1] 333
> trainIndex <- createDataPartition(diagnosis, p = .75,list=FALSE)
> training <- adData[trainIndex,]
> testing <- adData[-trainIndex,]
> # rows in training data frame
> nrow(training)
[1] 251
> # rows in testing data frame 
> nrow(testing)
[1] 82
>

Upvotes: 1

Rui Barradas

Reputation: 76663

You are creating an index column in the original data.frame. If you want to split the df into two sets, train and test, you can do the following.

library(caTools)

set.seed(6)    # make the results reproducible

inx <- sample.split(seq_len(nrow(customerduration_data)), 0.7)
train <- customerduration_data[inx, ]
test <-  customerduration_data[!inx, ]

This will not create column spl. In order to create it, use the answer by @RalfStubner.

EDIT.

Another way is to use sample with a vector of probabilities.

inx2 <- sample(c(FALSE, TRUE), 4495, replace = TRUE, prob = c(0.3, 0.7))

Testing the three solutions so far, I get the following results.

microbenchmark::microbenchmark(
  base_griffinevo = sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F),
  base_Rui = sample(c(FALSE, TRUE), 4495, replace = TRUE, prob = c(0.3, 0.7)),
  caTools_Ralf = sample.split(seq_len(nrow(customerduration_data)), 0.7)
)
#Unit: microseconds
#            expr     min       lq      mean  median        uq      max neval
# base_griffinevo 177.072 183.7665  219.3547 195.147  239.6660  523.851   100
#        base_Rui  89.708  93.2225  119.4083 119.666  134.5615  253.389   100
#    caTools_Ralf 838.495 861.4235 1103.0870 926.361 1313.1390 3634.478   100

So the simpler, base R way is also the fastest.

Upvotes: 1

rg255

Reputation: 4169

As an alternative you could use base R which results in a quicker option (3.4 x according to microbenchmark) and requires no additional packages:

df$spl <- sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F)

Splitting this into datasets as:

df$spl <- sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F)
test_data  <- df[df[,'spl'] %in% TRUE, ]
train_data <- df[df[,'spl'] %in% FALSE, ]

Upvotes: 2

Ralf Stubner

Reputation: 26843

The function sample.split expects a vector. Here a simple way to achieve that:

library(caTools)
customerduration_data$spl <- sample.split(seq_len(nrow(customerduration_data)), 
                                          SplitRatio = 0.7)

Upvotes: 1

Splitting a data frame into training and testing sets in R

Answers (4)

Related Questions