Sample based on a data frame in R

Question

Assume we have two data frames data1 and data2, both with same columns, e.g.

 > head(data1)
  ID Region Age Label
1  1     CC  20     0
2  2     BB  20     1
3  3     AA  40     0
4  4     BB  60     1
5  5     BB  40     0
6  6     BB  40     1

Assume all features are factors (except ID).

Question: How do I take a representative sample from data2 based on data1? E.g. based on product frequency of data1, see below: take 6 sample with Region:AA, Age:20, Label:0, take 1 sample with Region:AA, Age:20, Label:1, etc...

> head(count(data1, c("Region", "Age", "Label")))
  Region Age Label freq
1     AA  20     0    6
2     AA  20     1    1
3     AA  40     0    3
4     AA  40     1    5
5     AA  60     0    5
6     AA  60     1    3

I was looking at the sampling package as well as dplyr package. But I can't get my head around it. Formally, I am looking for a way for stratified sampling from data2 based on distribution of features in data1.

Thank you.

Edit: First, credit goes to @Jesse Tweedle for his concise answer below using dplyr. Here an alternative partial solution using libraries sampling (function strata) and data.table is presented:

library(sampling)
library(data.table)

d1 <- data.frame(ID = 1:100, 
                 region = sample(c("AA", "BB", "CC"), 100, replace = TRUE), 
                 age = sample(c(20,40,60),100,replace = TRUE), 
                 label = sample(c(0,1), 100, replace = TRUE))
d1.table = as.data.table(d1)

d2 <- data.frame(ID = 1:1000, 
                 region = sample(c("AA", "BB", "CC"), 1000, replace = TRUE), 
                 age = sample(c(20,40,60),1000,replace = TRUE), 
                 label = sample(c(0,1), 1000, replace = TRUE))
d2.table = as.data.table(d2)

#Sort
setkey(d1.table, region, age)
setkey(d2.table, region, age)

d1.table.freq = d1.table[,.N,keyby = list(region, age)]

d2.sample = data.table(strata(d2.table,
                              c("region", "age"),
                              d1.table.freq$N,
                              "srswor")) # random sampling without replacement

Of course this implies that all combinations of features which appear in d1 (i.e. are not 0) have to appear in d2 and the other way around. From that point of view it is not a general solution but a partial one.

Sample based on a data frame in R

Answers (1)

partly-dplyr answer

Related Questions