JJR
JJR

Reputation: 131

Sample based on a data frame in R

Assume we have two data frames data1 and data2, both with same columns, e.g.

 > head(data1)
  ID Region Age Label
1  1     CC  20     0
2  2     BB  20     1
3  3     AA  40     0
4  4     BB  60     1
5  5     BB  40     0
6  6     BB  40     1

Assume all features are factors (except ID).

Question: How do I take a representative sample from data2 based on data1? E.g. based on product frequency of data1, see below: take 6 sample with Region:AA, Age:20, Label:0, take 1 sample with Region:AA, Age:20, Label:1, etc...

> head(count(data1, c("Region", "Age", "Label")))
  Region Age Label freq
1     AA  20     0    6
2     AA  20     1    1
3     AA  40     0    3
4     AA  40     1    5
5     AA  60     0    5
6     AA  60     1    3

I was looking at the sampling package as well as dplyr package. But I can't get my head around it. Formally, I am looking for a way for stratified sampling from data2 based on distribution of features in data1.

Thank you.

Edit: First, credit goes to @Jesse Tweedle for his concise answer below using dplyr. Here an alternative partial solution using libraries sampling (function strata) and data.table is presented:

library(sampling)
library(data.table)

d1 <- data.frame(ID = 1:100, 
                 region = sample(c("AA", "BB", "CC"), 100, replace = TRUE), 
                 age = sample(c(20,40,60),100,replace = TRUE), 
                 label = sample(c(0,1), 100, replace = TRUE))
d1.table = as.data.table(d1)

d2 <- data.frame(ID = 1:1000, 
                 region = sample(c("AA", "BB", "CC"), 1000, replace = TRUE), 
                 age = sample(c(20,40,60),1000,replace = TRUE), 
                 label = sample(c(0,1), 1000, replace = TRUE))
d2.table = as.data.table(d2)

#Sort
setkey(d1.table, region, age)
setkey(d2.table, region, age)

d1.table.freq = d1.table[,.N,keyby = list(region, age)]

d2.sample = data.table(strata(d2.table,
                              c("region", "age"),
                              d1.table.freq$N,
                              "srswor")) # random sampling without replacement

Of course this implies that all combinations of features which appear in d1 (i.e. are not 0) have to appear in d2 and the other way around. From that point of view it is not a general solution but a partial one.

Upvotes: 0

Views: 482

Answers (1)

twedl
twedl

Reputation: 1648

partly-dplyr answer

Here's some fake data, with a counts dataset:

data1 <- tibble(id = 1:30,
                region = sample(letters[1:3], 30, replace = TRUE),
                label = sample(0:1, 30, replace = TRUE))
counts <- data1 %>% group_by(region, label) %>% count()

data2 <- tibble(id = 1:300,
                region = sample(letters[1:3], 300, replace = TRUE),
                label = sample(0:1, 300, replace = TRUE))

sample_n usually helps here, but won't take separate size arguments for each group. So we join the counts dataset, split on the region, label variables, map sample_n to each list, using size = n (where n is from count), and then use bind_rows to put the list of dataframes back together:

data2 %>%
  left_join(counts) %>%
  split(list(data2$region, data2$label)) %>%
  map(~ sample_n(.x, size = unique(.x$n))) %>%
  bind_rows()

If your datasets are much different you may have to use replace = TRUE in the sample_n function.

Upvotes: 1

Related Questions