Reputation: 103
** edited because I'm a doofus - with replacement, not without **
I have a large-ish (>500k rows) dataset with 421 groups, defined by two grouping variables. Sample data as follows:
df<-data.frame(group_one=rep((0:9),26), group_two=rep((letters),10))
head(df)
group_one group_two
1 0 a
2 1 b
3 2 c
4 3 d
5 4 e
6 5 f
...and so on.
What I want is some number (k = 12 at the moment, but that number may vary) of stratified samples, by membership in (group_one x group_two). Membership in each group should be indicated by a new column, sample_membership, which has a value of 1 through k (again, 12 at the moment). I should be able to subset by sample_membership and get up to 12 distinct samples, each of which is representative when considering group_one and group_two.
Final data set would thus look something like this:
group_one group_two sample_membership
1 0 a 1
2 0 a 12
3 0 a 5
4 1 a 5
5 1 a 7
6 1 a 9
Thoughts? Thanks very much in advance!
Upvotes: 3
Views: 4865
Reputation: 38500
Here is a base R method, that assumes that your data.frame is sorted by groups:
# get number of observations for each group
groupCnt <- with(df, aggregate(group_one, list(group_one, group_two), FUN=length))$x
# for reproducibility, set the seed
set.seed(1234)
# get sample by group
df$sample <- c(sapply(groupCnt, function(i) sample(12, i, replace=TRUE)))
Upvotes: 2
Reputation: 18995
Here's a one-line data.table
approach, which you should definitely consider if you have a long data.frame
.
library(data.table)
setDT(df)
df[, sample_membership := sample.int(12, .N, replace=TRUE), keyby = .(group_one, group_two)]
df
# group_one group_two sample_membership
# 1: 0 a 9
# 2: 0 a 8
# 3: 0 c 10
# 4: 0 c 4
# 5: 0 e 9
# ---
# 256: 9 v 4
# 257: 9 x 7
# 258: 9 x 11
# 259: 9 z 3
# 260: 9 z 8
For sampling without replacement, use replace=FALSE
, but as noted elsewhere, make sure you have fewer than k members per group. OR:
If you want to use "sampling without unnecessary replacement" (making this up -- not sure what the right terminology is here) because you have more than k members per group but still want to keep the groups as evenly sized as possible, you could do something like:
# example with bigger groups
k <- 12L
big_df <- data.frame(group_one=rep((0:9),260), group_two=rep((letters),100))
setDT(big_df)
big_df[, sample_round := rep(1:.N, each=k, length.out=.N), keyby = .(group_one, group_two)]
big_df[, sample_membership := sample.int(k, .N, replace=FALSE), keyby = .(group_one, group_two, sample_round)]
head(big_df, 15) # you can see first repeat does not occur until row k+1
Within each "sampling round" (first k observations in the group, second k observations in the group, etc.) there is sampling without replacement. Then, if necessary, the next sampling round makes all k assignments available again.
This approach would really evenly stratify the sample (but perfectly even is only possible if you have a multiple of k members in each group).
Upvotes: 4
Reputation: 1579
Maybe something like this?:
library(dplyr)
df %>%
group_by(group_one, group_two) %>%
mutate(sample_membership = sample(1:12, n(), replace = FALSE))
Upvotes: 8
Reputation: 555
Untested example using dplyr
, if it doesn't work it might point you in the right direction.
library( dplyr )
set.seed(123)
df <- data.frame(
group_one = as.integer( runif( 1000, 1, 6) ),
group_two = sample( LETTERS[1:6], 1000, TRUE)
) %>%
group_by( group_one, group_two ) %>%
mutate(
sample_membership = sample( seq(1, length(group_one) ), length(group_one), FALSE)
)
Good luck!
Upvotes: 0