Reelina
Reelina

Reputation: 1

R code to split a data into equal sized distinct samples

I am having trouble in writing the right R code for obtaining 4 distinct samples of equal size out of a data set.

Need your help!

Thanks and Regards, Reelina

Upvotes: 0

Views: 4561

Answers (3)

jamieRowen
jamieRowen

Reputation: 1549

It really depends on what your goal is as to what you might want to try here. I am going to assume that given a data frame you want to create four subsets of equal size where each subset is a randomly sampled quarter of the data.

For demo purposes I have used the Seatbelts data included in base R as this has a number of rows that is a multiple of 4. This solution uses base R functions only. For more involved data frame manipulation I suggest looking at the dplyr package.

# use seat belts data as example as it has nrow(x) %% 4 == 0
data(Seatbelts)
# generate a random sample of numbers 1:4 such that each occurs equally
ind = sample(rep(1:4,each = nrow(Seatbelts)/4))
# you could add that as a column to your data frame allowing the groups to be
# specified in formulae etc
# or if you want the four subsets
lapply(split(1:nrow(Seatbelts),ind), function(i) Seatbelts[i,])

If your data is a vector then this is easier

x = runif(24)
ind = sample(rep(1:4,each = length(x)/4))
split(x,ind)

If you don't want random sampling then just create ind as

ind = rep(1:4,each = length(x)/4)

and split in the same way as before.

You should be careful using things like cut as this will not give you 4 subsets of equal size necessarily.

table(as.numeric(cut(x,4)))

# 1 2 3 4 
# 7 6 3 8 

This is because cut cuts the range of x into intervals rather than it's length.

Upvotes: 2

Dave2e
Dave2e

Reputation: 24069

One can use the cut command:

x<-1:100
cutindex<-cut(x, breaks=4)

To rename the cut points use the "levels" command:

levels(cutindex)<-c("A", "B", "C", "D")

Once the data has been cut, I would suggest using the group_by command from the dplyr package for additional analysis.

Upvotes: 0

Raad
Raad

Reputation: 2715

How about this approach?

# Create data for example
x <- data.frame(id = 1:100, y = rnorm(100), z = rnorm(100))

# Returns a list with four equally sized distinct samples of the data
lapply(split(sample(nrow(x)), ceiling((1:nrow(x))/25)), function(i) x[i, ])

Upvotes: 0

Related Questions