Reputation: 2071

How to split data into training/testing sets using sample function

I've just started using R and I'm not sure how to incorporate my dataset with the following sample code:

sample(x, size, replace = FALSE, prob = NULL)

I have a dataset that I need to put into a training (75%) and testing (25%) set. I'm not sure what information I'm supposed to put into the x and size? Is x the dataset file, and size how many samples I have?

Upvotes: 207

Answers (28)

dickoa

Reputation: 18437

There are numerous approaches to achieve data partitioning. For a more complete approach take a look at the createDataPartition function in the caret package.

Here is a simple example:

data(mtcars)

## 75% of the sample size
smp_size <- floor(0.75 * nrow(mtcars))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mtcars)), size = smp_size)

train <- mtcars[train_ind, ]
test <- mtcars[-train_ind, ]

Upvotes: 320

Shreyas Shrawage

Reputation: 420

try using idx <- sample(2, nrow(data), replace = TRUE, prob = c(0.75, 0.25)) and the using the provided ids to access split data training <- data[idx == 1,] testing <- data[idx == 2,]

Upvotes: 0

Jeffrey Harding

Reputation: 60

I wrote a function (my first one, so it might not work well) to make this go faster if I'm working with multiple data tables and don't want to repeat the code.

xtrain <- function(data, proportion, t1, t2){
  data <- data %>% rowid_to_column("rowid")
  train <- slice_sample(data, prop = proportion)
  assign(t1, train, envir = .GlobalEnv)
  test <- data %>% anti_join(as.data.frame(train), by = "rowid")
  assign(t2, test, envir = .GlobalEnv)
}

xtrain(iris, .80, 'train_set', 'test_set')

You'll need to have dplyr and tibble loaded. This takes a given dataset, the proportion you want use for sampling, and two object names. The function creates the table and then assigns them as an object in your global environment.

Upvotes: 0

Trace Ashley

Reputation: 37

Create an index row "rowid" and use anti join to filter out using by = "rowid". You can remove the rowid column by using %>% select(-rowid) after the split.

data <- tibble::rowid_to_column(data)

set.seed(11081995)

testdata <- data %>% slice_sample(prop = 0.2)

traindata <- anti_join(data, testdata, by = "rowid")

Upvotes: 1

vladiim

Reputation: 1960

I prefer using dplyr to mutate the values

set.seed(1)
mutate(x, train = runif(1) < 0.75)

I can keep using dplyr::filter with helper functions like

data.split <- function(is_train = TRUE) {
    set.seed(1)
    mutate(x, train = runif(1) < 0.75) %>%
    filter(train == is_train)
}

Upvotes: 0

Soham Mukherjee

Reputation: 1

I think this would solve the problem:

df = data.frame(read.csv("data.csv"))
# Split the dataset into 80-20
numberOfRows = nrow(df)
bound = as.integer(numberOfRows *0.8)
train=df[1:bound ,2]
test1= df[(bound+1):numberOfRows ,2]

Upvotes: 0

Adarsh Pawar

Reputation: 746

We can divide data into a particular ratio here it is 80% train and 20% in a test dataset.

ind <- sample(2, nrow(dataName), replace = T, prob = c(0.8,0.2))
train <- dataName[ind==1, ]
test <- dataName[ind==2, ]

Upvotes: 3

user322203

Reputation: 101

I bumped into this one, it can help too.

set.seed(12)
data = Sonar[sample(nrow(Sonar)),]#reshufles the data
bound = floor(0.7 * nrow(data))
df_train = data[1:bound,]
df_test = data[(bound+1):nrow(data),]

Upvotes: 1

camnesia

Reputation: 2333

scorecard package has a useful function for that, where you can specify the ratio and seed

library(scorecard)

dt_list <- split_df(mtcars, ratio = 0.75, seed = 66)

The test and train data are stored in a list and can be accessed by calling dt_list$train and dt_list$test

Upvotes: 6

Joe

Reputation: 138

After looking through all the different methods posted here, I didn't see anyone utilize TRUE/FALSE to select and unselect data. So I thought I would share a method utilizing that technique.

n = nrow(dataset)
split = sample(c(TRUE, FALSE), n, replace=TRUE, prob=c(0.75, 0.25))

training = dataset[split, ]
testing = dataset[!split, ]

Explanation

There are multiple ways of selecting data from R, most commonly people use positive/negative indices to select/unselect respectively. However, the same functionalities can be achieved by using TRUE/FALSE to select/unselect.

Consider the following example.

# let's explore ways to select every other element
data = c(1, 2, 3, 4, 5)


# using positive indices to select wanted elements
data[c(1, 3, 5)]
[1] 1 3 5

# using negative indices to remove unwanted elements
data[c(-2, -4)]
[1] 1 3 5

# using booleans to select wanted elements
data[c(TRUE, FALSE, TRUE, FALSE, TRUE)]
[1] 1 3 5

# R recycles the TRUE/FALSE vector if it is not the correct dimension
data[c(TRUE, FALSE)]
[1] 1 3 5

Upvotes: 10

Shawn Azdam

Reputation: 6200

Just a more brief and simple way using awesome dplyr library:

library(dplyr)
set.seed(275) #to get repeatable data

data.train <- sample_frac(Default, 0.7)

train_index <- as.numeric(rownames(data.train))
data.test <- Default[-train_index, ]

Upvotes: 7

user2502836

Reputation: 793

If you type:

?sample

If will launch a help menu to explain what the parameters of the sample function mean.

I am not an expert, but here is some code I have:

data <- data.frame(matrix(rnorm(400), nrow=100))
splitdata <- split(data[1:nrow(data),],sample(rep(1:4,as.integer(nrow(data)/4))))
test <- splitdata[[1]]
train <- rbind(splitdata[[1]],splitdata[[2]],splitdata[[3]])

This will give you 75% train and 25% test.

Upvotes: 5

Xavier Jiménez Albán

Reputation: 61

set.seed(123)
llwork<-sample(1:length(mydata),round(0.75*length(mydata),digits=0)) 
wmydata<-mydata[llwork, ]
tmydata<-mydata[-llwork, ]

Upvotes: 0

Corentin

Reputation: 21

Assuming df is your data frame, and that you want to create 75% train and 25% test

all <- 1:nrow(df)
train_i <- sort(sample(all, round(nrow(df)*0.75,digits = 0),replace=FALSE))
test_i <- all[-train_i]

Then to create a train and test data frames

df_train <- df[train_i,]
df_test <- df[test_i,]

Upvotes: 2

user9992957

Reputation:

I can suggest using the rsample package:

# choosing 75% of the data to be the training data
data_split <- initial_split(data, prop = .75)
# extracting training data and test data as two seperate dataframes
data_train <- training(data_split)
data_test  <- testing(data_split)

Upvotes: 17

dzeltzer

Reputation: 1000

Beware of sample for splitting if you look for reproducible results. If your data changes even slightly, the split will vary even if you use set.seed. For example, imagine the sorted list of IDs in you data is all the numbers between 1 and 10. If you just dropped one observation, say 4, sampling by location would yield a different results because now 5 to 10 all moved places.

An alternative method is to use a hash function to map IDs into some pseudo random numbers and then sample on the mod of these numbers. This sample is more stable because assignment is now determined by the hash of each observation, and not by its relative position.

For example:

require(openssl)  # for md5
require(data.table)  # for the demo data

set.seed(1)  # this won't help `sample`

population <- as.character(1e5:(1e6-1))  # some made up ID names

N <- 1e4  # sample size

sample1 <- data.table(id = sort(sample(population, N)))  # randomly sample N ids
sample2 <- sample1[-sample(N, 1)]  # randomly drop one observation from sample1

# samples are all but identical
sample1
sample2
nrow(merge(sample1, sample2))

[1] 9999

# row splitting yields very different test sets, even though we've set the seed
test <- sample(N-1, N/2, replace = F)

test1 <- sample1[test, .(id)]
test2 <- sample2[test, .(id)]
nrow(test1)

[1] 5000

nrow(merge(test1, test2))

[1] 2653

# to fix that, we can use some hash function to sample on the last digit

md5_bit_mod <- function(x, m = 2L) {
  # Inputs: 
  #  x: a character vector of ids
  #  m: the modulo divisor (modify for split proportions other than 50:50)
  # Output: remainders from dividing the first digit of the md5 hash of x by m
  as.integer(as.hexmode(substr(openssl::md5(x), 1, 1)) %% m)
}

# hash splitting preserves the similarity, because the assignment of test/train 
# is determined by the hash of each obs., and not by its relative location in the data
# which may change 
test1a <- sample1[md5_bit_mod(id) == 0L, .(id)]
test2a <- sample2[md5_bit_mod(id) == 0L, .(id)]
nrow(merge(test1a, test2a))

[1] 5057

nrow(test1a)

[1] 5057

sample size is not exactly 5000 because assignment is probabilistic, but it shouldn't be a problem in large samples thanks to the law of large numbers.

Upvotes: 1

Dan Butorovich

Reputation: 9

There is a very simple way to select a number of rows using the R index for rows and columns. This lets you CLEANLY split the data set given a number of rows - say the 1st 80% of your data.

In R all rows and columns are indexed so DataSetName[1,1] is the value assigned to the first column and first row of "DataSetName". I can select rows using [x,] and columns using [,x]

For example: If I have a data set conveniently named "data" with 100 rows I can view the first 80 rows using

View(data[1:80,])

In the same way I can select these rows and subset them using:

train = data[1:80,]

test = data[81:100,]

Now I have my data split into two parts without the possibility of resampling. Quick and easy.

Upvotes: -2

Johnny V

Reputation: 1238

My solution shuffles the rows, then takes the first 75% of the rows as train and the last 25% as test. Super simples!

row_count <- nrow(orders_pivotted)
shuffled_rows <- sample(row_count)
train <- orders_pivotted[head(shuffled_rows,floor(row_count*0.75)),]
test <- orders_pivotted[tail(shuffled_rows,floor(row_count*0.25)),]

Upvotes: 4

Abhishek

Reputation: 618

require(caTools)

set.seed(101)            #This is used to create same samples everytime

split1=sample.split(data$anycol,SplitRatio=2/3)

train=subset(data,split1==TRUE)

test=subset(data,split1==FALSE)

The sample.split() function will add one extra column 'split1' to dataframe and 2/3 of the rows will have this value as TRUE and others as FALSE.Now the rows where split1 is TRUE will be copied into train and other rows will be copied to test dataframe.

Upvotes: 1

Konstantin Mingoulin

Reputation: 181

Use base R. Function runif generates uniformly distributed values from 0 to 1.By varying cutoff value (train.size in example below), you will always have approximately the same percentage of random records below the cutoff value.

data(mtcars)
set.seed(123)

#desired proportion of records in training set
train.size<-.7
#true/false vector of values above/below the cutoff above
train.ind<-runif(nrow(mtcars))<train.size

#train
train.df<-mtcars[train.ind,]


#test
test.df<-mtcars[!train.ind,]

Upvotes: 2

Edwin

Reputation: 3252

I would use dplyr for this, makes it super simple. It does require an id variable in your data set, which is a good idea anyway, not only for creating sets but also for traceability during your project. Add it if doesn't contain already.

mtcars$id <- 1:nrow(mtcars)
train <- mtcars %>% dplyr::sample_frac(.75)
test  <- dplyr::anti_join(mtcars, train, by = 'id')

Upvotes: 44

Yash Sharma

Reputation: 142

Use caTools package in R sample code will be as follows:-

data
split = sample.split(data$DependentcoloumnName, SplitRatio = 0.6)
training_set = subset(data, split == TRUE)
test_set = subset(data, split == FALSE)

Upvotes: 2

TheMI

Reputation: 1761

It can be easily done by:

set.seed(101) # Set Seed so that same sample can be reproduced in future also
# Now Selecting 75% of data as sample from total 'n' rows of the data  
sample <- sample.int(n = nrow(data), size = floor(.75*nrow(data)), replace = F)
train <- data[sample, ]
test  <- data[-sample, ]

By using caTools package:

require(caTools)
set.seed(101) 
sample = sample.split(data$anycolumn, SplitRatio = .75)
train = subset(data, sample == TRUE)
test  = subset(data, sample == FALSE)

Upvotes: 124

Alex

Reputation: 12943

My solution is basically the same as dickoa's but a little easier to interpret:

data(mtcars)
n = nrow(mtcars)
trainIndex = sample(1:n, size = round(0.7*n), replace=FALSE)
train = mtcars[trainIndex ,]
test = mtcars[-trainIndex ,]

Upvotes: 17

hyunwoo jeong

Reputation: 1634

I will split 'a' into train(70%) and test(30%)

    a # original data frame
    library(dplyr)
    train<-sample_frac(a, 0.7)
    sid<-as.numeric(rownames(train)) # because rownames() returns character
    test<-a[-sid,]

done

Upvotes: 22

pradnya chavan

Reputation: 277

library(caret)
intrain<-createDataPartition(y=sub_train$classe,p=0.7,list=FALSE)
training<-m_train[intrain,]
testing<-m_train[-intrain,]

Upvotes: 24

Yohan Obadia

Reputation: 2682

Below a function that create a list of sub-samples of the same size which is not exactly what you wanted but might prove usefull for others. In my case to create multiple classification trees on smaller samples to test overfitting :

df_split <- function (df, number){
  sizedf      <- length(df[,1])
  bound       <- sizedf/number
  list        <- list() 
  for (i in 1:number){
    list[i] <- list(df[((i*bound+1)-bound):(i*bound),])
  }
  return(list)
}

Example :

x <- matrix(c(1:10), ncol=1)
x
# [,1]
# [1,]    1
# [2,]    2
# [3,]    3
# [4,]    4
# [5,]    5
# [6,]    6
# [7,]    7
# [8,]    8
# [9,]    9
#[10,]   10

x.split <- df_split(x,5)
x.split
# [[1]]
# [1] 1 2

# [[2]]
# [1] 3 4

# [[3]]
# [1] 5 6

# [[4]]
# [1] 7 8

# [[5]]
# [1] 9 10

Upvotes: 2

Katerina

Reputation: 2610

This is almost the same code, but in more nice look

bound <- floor((nrow(df)/4)*3)         #define % of training and test set

df <- df[sample(nrow(df)), ]           #sample rows 
df.train <- df[1:bound, ]              #get training set
df.test <- df[(bound+1):nrow(df), ]    #get test set

Upvotes: 31

How to split data into training/testing sets using sample function

Answers (28)

Explanation

Related Questions