seehuus
seehuus

Reputation: 103

Generate random numbers by group with replacement

** edited because I'm a doofus - with replacement, not without **

I have a large-ish (>500k rows) dataset with 421 groups, defined by two grouping variables. Sample data as follows:

df<-data.frame(group_one=rep((0:9),26), group_two=rep((letters),10))

head(df)

  group_one group_two
1         0         a
2         1         b
3         2         c
4         3         d
5         4         e
6         5         f

...and so on.

What I want is some number (k = 12 at the moment, but that number may vary) of stratified samples, by membership in (group_one x group_two). Membership in each group should be indicated by a new column, sample_membership, which has a value of 1 through k (again, 12 at the moment). I should be able to subset by sample_membership and get up to 12 distinct samples, each of which is representative when considering group_one and group_two.

Final data set would thus look something like this:

  group_one group_two sample_membership
1         0         a                 1  
2         0         a                12
3         0         a                 5
4         1         a                 5
5         1         a                 7
6         1         a                 9

Thoughts? Thanks very much in advance!

Upvotes: 3

Views: 4865

Answers (4)

lmo
lmo

Reputation: 38500

Here is a base R method, that assumes that your data.frame is sorted by groups:

# get number of observations for each group
groupCnt <- with(df, aggregate(group_one, list(group_one, group_two), FUN=length))$x

# for reproducibility, set the seed
set.seed(1234)    
# get sample by group
df$sample <- c(sapply(groupCnt, function(i) sample(12, i, replace=TRUE)))

Upvotes: 2

C8H10N4O2
C8H10N4O2

Reputation: 18995

Here's a one-line data.table approach, which you should definitely consider if you have a long data.frame.

library(data.table)

setDT(df)

df[, sample_membership := sample.int(12, .N, replace=TRUE), keyby = .(group_one, group_two)]

df
#    group_one group_two sample_membership
#   1:         0         a                 9
#   2:         0         a                 8
#   3:         0         c                10
#   4:         0         c                 4
#   5:         0         e                 9
# ---                                      
# 256:         9         v                 4
# 257:         9         x                 7
# 258:         9         x                11
# 259:         9         z                 3
# 260:         9         z                 8

For sampling without replacement, use replace=FALSE, but as noted elsewhere, make sure you have fewer than k members per group. OR:

If you want to use "sampling without unnecessary replacement" (making this up -- not sure what the right terminology is here) because you have more than k members per group but still want to keep the groups as evenly sized as possible, you could do something like:

# example with bigger groups
k <- 12L
big_df <- data.frame(group_one=rep((0:9),260), group_two=rep((letters),100))
setDT(big_df)

big_df[, sample_round := rep(1:.N, each=k, length.out=.N), keyby = .(group_one, group_two)]
big_df[, sample_membership := sample.int(k, .N, replace=FALSE), keyby = .(group_one, group_two, sample_round)]
head(big_df, 15) # you can see first repeat does not occur until row k+1 

Within each "sampling round" (first k observations in the group, second k observations in the group, etc.) there is sampling without replacement. Then, if necessary, the next sampling round makes all k assignments available again.

This approach would really evenly stratify the sample (but perfectly even is only possible if you have a multiple of k members in each group).

Upvotes: 4

Shorpy
Shorpy

Reputation: 1579

Maybe something like this?:

library(dplyr)
  df %>% 
    group_by(group_one, group_two) %>% 
    mutate(sample_membership = sample(1:12, n(), replace = FALSE))

Upvotes: 8

Jasper
Jasper

Reputation: 555

Untested example using dplyr, if it doesn't work it might point you in the right direction.

library( dplyr )
set.seed(123)
df <- data.frame(
  group_one = as.integer( runif( 1000, 1, 6) ),
  group_two = sample( LETTERS[1:6], 1000, TRUE)
) %>%
  group_by( group_one, group_two ) %>%
  mutate(
    sample_membership = sample( seq(1, length(group_one) ), length(group_one), FALSE)
  )

Good luck!

Upvotes: 0

Related Questions