Reputation: 2071
I've just started using R and I'm not sure how to incorporate my dataset with the following sample code:
sample(x, size, replace = FALSE, prob = NULL)
I have a dataset that I need to put into a training (75%) and testing (25%) set. I'm not sure what information I'm supposed to put into the x and size? Is x the dataset file, and size how many samples I have?
Upvotes: 207
Views: 681212
Reputation: 18437
There are numerous approaches to achieve data partitioning. For a more complete approach take a look at the createDataPartition
function in the caret
package.
Here is a simple example:
data(mtcars)
## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)
train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]
Upvotes: 320
Reputation: 420
try using idx <- sample(2, nrow(data), replace = TRUE, prob = c(0.75, 0.25))
and the using the provided ids to access split data training <- data[idx == 1,] testing <- data[idx == 2,]
Upvotes: 0
Reputation: 60
I wrote a function (my first one, so it might not work well) to make this go faster if I'm working with multiple data tables and don't want to repeat the code.
xtrain <- function(data, proportion, t1, t2){
data <- data %>% rowid_to_column("rowid")
train <- slice_sample(data, prop = proportion)
assign(t1, train, envir = .GlobalEnv)
test <- data %>% anti_join(as.data.frame(train), by = "rowid")
assign(t2, test, envir = .GlobalEnv)
}
xtrain(iris, .80, 'train_set', 'test_set')
You'll need to have dplyr and tibble loaded. This takes a given dataset, the proportion you want use for sampling, and two object names. The function creates the table and then assigns them as an object in your global environment.
Upvotes: 0
Reputation: 37
Create an index row "rowid" and use anti join to filter out using by = "rowid". You can remove the rowid column by using %>% select(-rowid) after the split.
data <- tibble::rowid_to_column(data)
set.seed(11081995)
testdata <- data %>% slice_sample(prop = 0.2)
traindata <- anti_join(data, testdata, by = "rowid")
Upvotes: 1
Reputation: 1960
I prefer using dplyr
to mutate
the values
set.seed(1)
mutate(x, train = runif(1) < 0.75)
I can keep using dplyr::filter
with helper functions like
data.split <- function(is_train = TRUE) {
set.seed(1)
mutate(x, train = runif(1) < 0.75) %>%
filter(train == is_train)
}
Upvotes: 0
Reputation: 1
I think this would solve the problem:
df = data.frame(read.csv("data.csv"))
# Split the dataset into 80-20
numberOfRows = nrow(df)
bound = as.integer(numberOfRows *0.8)
train=df[1:bound ,2]
test1= df[(bound+1):numberOfRows ,2]
Upvotes: 0
Reputation: 738
We can divide data into a particular ratio here it is 80% train and 20% in a test dataset.
ind <- sample(2, nrow(dataName), replace = T, prob = c(0.8,0.2))
train <- dataName[ind==1, ]
test <- dataName[ind==2, ]
Upvotes: 3
Reputation: 101
I bumped into this one, it can help too.
set.seed(12)
data = Sonar[sample(nrow(Sonar)),]#reshufles the data
bound = floor(0.7 * nrow(data))
df_train = data[1:bound,]
df_test = data[(bound+1):nrow(data),]
Upvotes: 1
Reputation: 2323
scorecard
package has a useful function for that, where you can specify the ratio and seed
library(scorecard)
dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)
The test and train data are stored in a list and can be accessed by calling dt_list$train
and dt_list$test
Upvotes: 6
Reputation: 138
After looking through all the different methods posted here, I didn't see anyone utilize TRUE/FALSE
to select and unselect data. So I thought I would share a method utilizing that technique.
n = nrow(dataset)
split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.75, 0.25))
training = dataset[split, ]
testing = dataset[!split, ]
There are multiple ways of selecting data from R, most commonly people use positive/negative indices to select/unselect respectively. However, the same functionalities can be achieved by using TRUE/FALSE
to select/unselect.
Consider the following example.
# let's explore ways to select every other element
data = c(1, 2, 3, 4, 5)
# using positive indices to select wanted elements
data[c(1, 3, 5)]
[1] 1 3 5
# using negative indices to remove unwanted elements
data[c(-2, -4)]
[1] 1 3 5
# using booleans to select wanted elements
data[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5
# R recycles the TRUE/FALSE vector if it is not the correct dimension
data[c(TRUE, FALSE)]
[1] 1 3 5
Upvotes: 10
Reputation: 6160
Just a more brief and simple way using awesome dplyr library:
library(dplyr)
set.seed(275) #to get repeatable data
data.train <- sample_frac(Default, 0.7)
train_index <- as.numeric(rownames(data.train))
data.test <- Default[-train_index, ]
Upvotes: 7
Reputation: 783
If you type:
?sample
If will launch a help menu to explain what the parameters of the sample function mean.
I am not an expert, but here is some code I have:
data <- data.frame(matrix(rnorm(400), nrow=100))
splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4))))
test <- splitdata[[1]]
train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])
This will give you 75% train and 25% test.
Upvotes: 5
Reputation: 61
set.seed(123)
llwork<-sample(1:length(mydata),round(0.75*length(mydata),digits=0))
wmydata<-mydata[llwork, ]
tmydata<-mydata[-llwork, ]
Upvotes: 0
Reputation: 21
Assuming df is your data frame, and that you want to create 75% train and 25% test
all <- 1:nrow(df)
train_i <- sort(sample(all, round(nrow(df)*0.75,digits = 0),replace=FALSE))
test_i <- all[-train_i]
Then to create a train and test data frames
df_train <- df[train_i,]
df_test <- df[test_i,]
Upvotes: 2
Reputation:
I can suggest using the rsample package:
# choosing 75% of the data to be the training data
data_split <- initial_split(data, prop = .75)
# extracting training data and test data as two seperate dataframes
data_train <- training(data_split)
data_test <- testing(data_split)
Upvotes: 17
Reputation: 1000
Beware of sample
for splitting if you look for reproducible results. If your data changes even slightly, the split will vary even if you use set.seed
. For example, imagine the sorted list of IDs in you data is all the numbers between 1 and 10. If you just dropped one observation, say 4, sampling by location would yield a different results because now 5 to 10 all moved places.
An alternative method is to use a hash function to map IDs into some pseudo random numbers and then sample on the mod of these numbers. This sample is more stable because assignment is now determined by the hash of each observation, and not by its relative position.
For example:
require(openssl) # for md5
require(data.table) # for the demo data
set.seed(1) # this won't help `sample`
population <- as.character(1e5:(1e6-1)) # some made up ID names
N <- 1e4 # sample size
sample1 <- data.table(id = sort(sample(population, N))) # randomly sample N ids
sample2 <- sample1[-sample(N, 1)] # randomly drop one observation from sample1
# samples are all but identical
sample1
sample2
nrow(merge(sample1, sample2))
[1] 9999
# row splitting yields very different test sets, even though we've set the seed
test <- sample(N-1, N/2, replace = F)
test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)
[1] 5000
nrow(merge(test1, test2))
[1] 2653
# to fix that, we can use some hash function to sample on the last digit
md5_bit_mod <- function(x, m = 2L) {
# Inputs:
# x: a character vector of ids
# m: the modulo divisor (modify for split proportions other than 50:50)
# Output: remainders from dividing the first digit of the md5 hash of x by m
as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}
# hash splitting preserves the similarity, because the assignment of test/train
# is determined by the hash of each obs., and not by its relative location in the data
# which may change
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))
[1] 5057
nrow(test1a)
[1] 5057
sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.
See also: http://blog.richardweiss.org/2016/12/25/hash-splits.html and https://crypto.stackexchange.com/questions/20742/statistical-properties-of-hash-functions-when-calculating-modulo
Upvotes: 1
Reputation: 9
There is a very simple way to select a number of rows using the R index for rows and columns. This lets you CLEANLY split the data set given a number of rows - say the 1st 80% of your data.
In R all rows and columns are indexed so DataSetName[1,1] is the value assigned to the first column and first row of "DataSetName". I can select rows using [x,] and columns using [,x]
For example: If I have a data set conveniently named "data" with 100 rows I can view the first 80 rows using
View(data[1:80,])
In the same way I can select these rows and subset them using:
train = data[1:80,]
test = data[81:100,]
Now I have my data split into two parts without the possibility of resampling. Quick and easy.
Upvotes: -2
Reputation: 1238
My solution shuffles the rows, then takes the first 75% of the rows as train and the last 25% as test. Super simples!
row_count <- nrow(orders_pivotted)
shuffled_rows <- sample(row_count)
train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),]
test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]
Upvotes: 4
Reputation: 618
require(caTools)
set.seed(101) #This is used to create same samples everytime
split1=sample.split(data$anycol,SplitRatio=2/3)
train=subset(data,split1==TRUE)
test=subset(data,split1==FALSE)
The sample.split()
function will add one extra column 'split1' to dataframe and 2/3 of the rows will have this value as TRUE and others as FALSE.Now the rows where split1 is TRUE will be copied into train and other rows will be copied to test dataframe.
Upvotes: 1
Reputation: 181
Use base R. Function runif
generates uniformly distributed values from 0 to 1.By varying cutoff value (train.size in example below), you will always have approximately the same percentage of random records below the cutoff value.
data(mtcars)
set.seed(123)
#desired proportion of records in training set
train.size<-.7
#true/false vector of values above/below the cutoff above
train.ind<-runif(nrow(mtcars))<train.size
#train
train.df<-mtcars[train.ind,]
#test
test.df<-mtcars[!train.ind,]
Upvotes: 2
Reputation: 3242
I would use dplyr
for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn't contain already.
mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test <- dplyr::anti_join(mtcars, train, by = 'id')
Upvotes: 43
Reputation: 142
Use caTools package in R sample code will be as follows:-
data
split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)
Upvotes: 2
Reputation: 1751
It can be easily done by:
set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test <- data[-sample, ]
By using caTools package:
require(caTools)
set.seed(101)
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test = subset(data, sample == FALSE)
Upvotes: 123
Reputation: 12913
My solution is basically the same as dickoa's but a little easier to interpret:
data(mtcars)
n = nrow(mtcars)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = mtcars[trainIndex ,]
test = mtcars[-trainIndex ,]
Upvotes: 17
Reputation: 1624
I will split 'a' into train(70%) and test(30%)
a # original data frame
library(dplyr)
train<-sample_frac(a, 0.7)
sid<-as.numeric(rownames(train)) # because rownames() returns character
test<-a[-sid,]
done
Upvotes: 22
Reputation: 277
library(caret)
intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE)
training<-m_train[intrain,]
testing<-m_train[-intrain,]
Upvotes: 24
Reputation: 2672
Below a function that create a list
of sub-samples of the same size which is not exactly what you wanted but might prove usefull for others. In my case to create multiple classification trees on smaller samples to test overfitting :
df_split <- function (df, number){
sizedf <- length(df[,1])
bound <- sizedf/number
list <- list()
for (i in 1:number){
list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
}
return(list)
}
Example :
x <- matrix(c(1:10), ncol=1)
x
# [,1]
# [1,] 1
# [2,] 2
# [3,] 3
# [4,] 4
# [5,] 5
# [6,] 6
# [7,] 7
# [8,] 8
# [9,] 9
#[10,] 10
x.split <- df_split(x,5)
x.split
# [[1]]
# [1] 1 2
# [[2]]
# [1] 3 4
# [[3]]
# [1] 5 6
# [[4]]
# [1] 7 8
# [[5]]
# [1] 9 10
Upvotes: 2
Reputation: 2600
This is almost the same code, but in more nice look
bound <- floor((nrow(df)/4)*3) #define % of training and test set
df <- df[sample(nrow(df)), ] #sample rows
df.train <- df[1:bound, ] #get training set
df.test <- df[(bound+1):nrow(df), ] #get test set
Upvotes: 31