Reputation: 14192
From these questions - Random sample of rows from subset of an R dataframe & Sample random rows in dataframe I can easily see how to randomly sample (select) 'n' rows from a df, or 'n' rows that originate from a specific level of a factor within a df.
Here are some sample data:
df <- data.frame(matrix(rnorm(80), nrow=40))
df$color <- rep(c("blue", "red", "yellow", "pink"), each=10)
df[sample(nrow(df), 3), ] #samples 3 random rows from df, without replacement.
To e.g. just sample 3 random rows from 'pink' color - using library(kimisc)
:
library(kimisc)
sample.rows(subset(df, color == "pink"), 3)
or writing custom function:
sample.df <- function(df, n) df[sample(nrow(df), n), , drop = FALSE]
sample.df(subset(df, color == "pink"), 3)
However, I want to sample 3 (or n) random rows from each level of the factor. I.e. the new df would have 12 rows (3 from blue, 3 from red, 3 from yellow, 3 from pink). It's obviously possible to run this several times, create newdfs for each color, and then bind them together, but I am looking for a simpler solution.
Upvotes: 31
Views: 42874
Reputation: 13046
Here's a solution. We split a data.frame into color groups. Then we sample 3 rows from each group. This yields a list of data.frames.
df2 <- lapply(split(df, df$color),
function(subdf) subdf[sample(1:nrow(subdf), 3),]
)
To obtain the desired result, we merge the list of data.frames into 1 data.frame:
do.call('rbind', df2)
## X1 X2 color
## blue.3 -1.22677188 1.25648082 blue
## blue.4 -0.54516686 -1.94342967 blue
## blue.1 0.44647071 0.16283326 blue
## pink.40 0.23520296 -0.40411906 pink
## pink.34 0.02033939 -0.32321309 pink
## pink.33 -1.01790533 -1.22618575 pink
## red.16 1.86545895 1.11691250 red
## red.11 1.35748078 -0.36044728 red
## red.13 -0.02425645 0.85335279 red
## yellow.21 1.96728782 -1.81388110 yellow
## yellow.25 -0.48084967 0.07865186 yellow
## yellow.24 -0.07056236 -0.28514125 yellow
Upvotes: 6
Reputation: 385
Here is a way, in base, that allows for multiple groups and sampling with replacement:
n <- 3
resample <- TRUE
index <- 1:nrow(df)
fun <- function(x) sample(x, n, replace = resample)
a <- aggregate(index, by = list(group = df$color), FUN = fun )
df[c(a$x),]
To add another group, include it in the 'by' argument to aggregate.
Upvotes: 0
Reputation: 173577
In versions of dplyr
0.3 and later, this works just fine:
df %>% group_by(color) %>% sample_n(size = 3)
dplyr
(version <= 0.2)I set out to answer this using dplyr, assuming that this would work:
df %.% group_by(color) %.% sample_n(size = 3)
But it turns out that in 0.2 the sample_n.grouped_df
S3 method exists but isn't registered in the NAMESPACE file, so it's never dispatched. Instead, I had to do this:
df %.% group_by(color) %.% dplyr:::sample_n.grouped_df(size = 3)
Source: local data frame [12 x 3]
Groups: color
X1 X2 color
8 0.66152710 -0.7767473 blue
1 -0.70293752 -0.2372700 blue
2 -0.46691793 -0.4382669 blue
32 -0.47547565 -1.0179842 pink
31 -0.15254540 -0.6149726 pink
39 0.08135292 -0.2141423 pink
15 0.47721644 -1.5033192 red
16 1.26160230 1.1202527 red
12 -2.18431919 0.2370912 red
24 0.10493757 1.4065835 yellow
21 -0.03950873 -1.1582658 yellow
28 -2.15872261 -1.5499822 yellow
Presumably this will be fixed in a future update.
Upvotes: 37
Reputation: 193517
I would consider my stratified
function, which is presently hosted as a GitHub Gist.
Get it with:
library(devtools) ## To download "stratified"
source_gist("https://gist.github.com/mrdwab/6424112")
And use it with:
stratified(df, "color", 3)
There are several different features that are convenient for stratified sampling. For instance, you can also take a sample sort of "on the fly".
stratified(df, "color", 3, select = list(color = c("blue", "red")))
To give you a sense of what the function does, here are the arguments to stratified
:
df
: The input data.frame
group
: A character vector of the column or columns that make up the "strata".size
: The desired sample size.
size
is a value less than 1, a proportionate sample is taken from each stratum.size
is a single integer of 1 or more, that number of samples is taken from each stratum.size
is a vector of integers, the specified number of samples is taken for each stratum. It is recommended that you use a named vector. For example, if you have two strata, "A" and "B", and you wanted 5 samples from "A" and 10 from "B", you would enter size = c(A = 5, B = 10)
.select
: This allows you to subset the groups in the sampling process. This is a list
. For instance, if your group
variable was "Group", and it contained three strata, "A", "B", and "C", but you only wanted to sample from "A" and "C", you can use select = list(Group = c("A", "C"))
.replace
: For sampling with replacement.Upvotes: 7
Reputation: 206232
You can assign a random ID to each element that has a particular factor level using ave
. Then you can select all random IDs in a certain range.
rndid <- with(df, ave(X1, color, FUN=function(x) {sample.int(length(x))}))
df[rndid<=3,]
This has the advantage of preserving the original row order and row names if that's something you are interested in. Plus you can re-use the rndid
vector to create subset of different lengths fairly easily.
Upvotes: 7