Reputation: 45
I have the following data.frame:
>str(customerduration_data)
Classes 'tbl_df', 'tbl' and 'data.frame': 4495 obs. of 4 variables:
$ monthofgateOUT : Ord.factor w/ 4 levels "8"<"9"<"10"<"11": 1 1 1 1 1 1 1 1 1 1 ...
$ dayofgateOUT : Ord.factor w/ 7 levels "Monday"<"Tuesday"<..: 4 5 1 1 1 1 1 2 2 3 ...
$ timeofgateOUT : Ord.factor w/ 20 levels "3"<"4"<"5"<"6"<..: 13 4 2 3 3 11 15 10 13 14 ...
$ durationCUST_hours: num 95.63 5.73 10.73 10.2 14.4 .
I want to split this data into a training and a test set, using the following command:
install.packages("caTools")
library (caTools)
set.seed(6)
customerduration_data$spl=sample.split(customerduration_data,SplitRatio=0.7)
However, after running the above command the following error occurs:
>Error in `$<-.data.frame`(`*tmp*`, spl, value = c(TRUE, FALSE, FALSE, :
replacement has 4 rows, data has 4495
How can I solve this problem?
Upvotes: 2
Views: 8679
Reputation: 10875
Here is an alternative using the caret
package, and its createDataPartition()
function. We'll use the Alzheimer Disease data from the Applied Predictive Modeling package to illustrate creation of test and training data sets.
library(AppliedPredictiveModeling)
library(caret)
data(AlzheimerDisease)
adData <- data.frame(diagnosis, predictors)
# count rows in data frame
nrow(adData)
trainIndex <- createDataPartition(diagnosis, p = .75,list=FALSE)
training <- adData[trainIndex,]
testing <- adData[-trainIndex,]
# rows in training data frame
nrow(training)
# rows in testing data frame
nrow(testing)
...and the output:
> library(AppliedPredictiveModeling)
> library(caret)
Loading required package: lattice
Loading required package: ggplot2
> data(AlzheimerDisease)
> adData <- data.frame(diagnosis, predictors)
> # count rows in data frame
> nrow(adData)
[1] 333
> trainIndex <- createDataPartition(diagnosis, p = .75,list=FALSE)
> training <- adData[trainIndex,]
> testing <- adData[-trainIndex,]
> # rows in training data frame
> nrow(training)
[1] 251
> # rows in testing data frame
> nrow(testing)
[1] 82
>
Upvotes: 1
Reputation: 76663
You are creating an index column in the original data.frame. If you want to split the df into two sets, train
and test
, you can do the following.
library(caTools)
set.seed(6) # make the results reproducible
inx <- sample.split(seq_len(nrow(customerduration_data)), 0.7)
train <- customerduration_data[inx, ]
test <- customerduration_data[!inx, ]
This will not create column spl
. In order to create it, use the answer by @RalfStubner.
EDIT.
Another way is to use sample
with a vector of probabilities.
inx2 <- sample(c(FALSE, TRUE), 4495, replace = TRUE, prob = c(0.3, 0.7))
Testing the three solutions so far, I get the following results.
microbenchmark::microbenchmark(
base_griffinevo = sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F),
base_Rui = sample(c(FALSE, TRUE), 4495, replace = TRUE, prob = c(0.3, 0.7)),
caTools_Ralf = sample.split(seq_len(nrow(customerduration_data)), 0.7)
)
#Unit: microseconds
# expr min lq mean median uq max neval
# base_griffinevo 177.072 183.7665 219.3547 195.147 239.6660 523.851 100
# base_Rui 89.708 93.2225 119.4083 119.666 134.5615 253.389 100
# caTools_Ralf 838.495 861.4235 1103.0870 926.361 1313.1390 3634.478 100
So the simpler, base R way is also the fastest.
Upvotes: 1
Reputation: 4169
As an alternative you could use base R which results in a quicker option (3.4 x according to microbenchmark
) and requires no additional packages:
df$spl <- sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F)
Splitting this into datasets as:
df$spl <- sample(c(rep(TRUE, floor(0.7*4495)), rep(FALSE, 4495-floor(0.7*4495))), replace = F)
test_data <- df[df[,'spl'] %in% TRUE, ]
train_data <- df[df[,'spl'] %in% FALSE, ]
Upvotes: 2
Reputation: 26843
The function sample.split
expects a vector. Here a simple way to achieve that:
library(caTools)
customerduration_data$spl <- sample.split(seq_len(nrow(customerduration_data)),
SplitRatio = 0.7)
Upvotes: 1