Reputation: 1
I am having trouble in writing the right R code for obtaining 4 distinct samples of equal size out of a data set.
Need your help!
Thanks and Regards, Reelina
Upvotes: 0
Views: 4561
Reputation: 1549
It really depends on what your goal is as to what you might want to try here. I am going to assume that given a data frame you want to create four subsets of equal size where each subset is a randomly sampled quarter of the data.
For demo purposes I have used the Seatbelts
data included in base R as this has a number of rows that is a multiple of 4. This solution uses base R functions only. For more involved data frame manipulation I suggest looking at the dplyr
package.
# use seat belts data as example as it has nrow(x) %% 4 == 0
data(Seatbelts)
# generate a random sample of numbers 1:4 such that each occurs equally
ind = sample(rep(1:4,each = nrow(Seatbelts)/4))
# you could add that as a column to your data frame allowing the groups to be
# specified in formulae etc
# or if you want the four subsets
lapply(split(1:nrow(Seatbelts),ind), function(i) Seatbelts[i,])
If your data is a vector then this is easier
x = runif(24)
ind = sample(rep(1:4,each = length(x)/4))
split(x,ind)
If you don't want random sampling then just create ind
as
ind = rep(1:4,each = length(x)/4)
and split in the same way as before.
You should be careful using things like cut
as this will not give you 4 subsets of equal size necessarily.
table(as.numeric(cut(x,4)))
# 1 2 3 4
# 7 6 3 8
This is because cut
cuts the range of x into intervals rather than it's length.
Upvotes: 2
Reputation: 24069
One can use the cut command:
x<-1:100
cutindex<-cut(x, breaks=4)
To rename the cut points use the "levels" command:
levels(cutindex)<-c("A", "B", "C", "D")
Once the data has been cut, I would suggest using the group_by command from the dplyr package for additional analysis.
Upvotes: 0
Reputation: 2715
How about this approach?
# Create data for example
x <- data.frame(id = 1:100, y = rnorm(100), z = rnorm(100))
# Returns a list with four equally sized distinct samples of the data
lapply(split(sample(nrow(x)), ceiling((1:nrow(x))/25)), function(i) x[i, ])
Upvotes: 0