Reputation: 131
Assume we have two data frames data1 and data2, both with same columns, e.g.
> head(data1)
ID Region Age Label
1 1 CC 20 0
2 2 BB 20 1
3 3 AA 40 0
4 4 BB 60 1
5 5 BB 40 0
6 6 BB 40 1
Assume all features are factors (except ID).
Question: How do I take a representative sample from data2 based on data1? E.g. based on product frequency of data1, see below: take 6 sample with Region:AA, Age:20, Label:0, take 1 sample with Region:AA, Age:20, Label:1, etc...
> head(count(data1, c("Region", "Age", "Label")))
Region Age Label freq
1 AA 20 0 6
2 AA 20 1 1
3 AA 40 0 3
4 AA 40 1 5
5 AA 60 0 5
6 AA 60 1 3
I was looking at the sampling
package as well as dplyr
package. But I can't get my head around it. Formally, I am looking for a way for stratified sampling from data2 based on distribution of features in data1.
Thank you.
Edit: First, credit goes to @Jesse Tweedle for his concise answer below using dplyr
. Here an alternative partial solution using libraries sampling
(function strata
) and data.table
is presented:
library(sampling)
library(data.table)
d1 <- data.frame(ID = 1:100,
region = sample(c("AA", "BB", "CC"), 100, replace = TRUE),
age = sample(c(20,40,60),100,replace = TRUE),
label = sample(c(0,1), 100, replace = TRUE))
d1.table = as.data.table(d1)
d2 <- data.frame(ID = 1:1000,
region = sample(c("AA", "BB", "CC"), 1000, replace = TRUE),
age = sample(c(20,40,60),1000,replace = TRUE),
label = sample(c(0,1), 1000, replace = TRUE))
d2.table = as.data.table(d2)
#Sort
setkey(d1.table, region, age)
setkey(d2.table, region, age)
d1.table.freq = d1.table[,.N,keyby = list(region, age)]
d2.sample = data.table(strata(d2.table,
c("region", "age"),
d1.table.freq$N,
"srswor")) # random sampling without replacement
Of course this implies that all combinations of features which appear in d1 (i.e. are not 0) have to appear in d2 and the other way around. From that point of view it is not a general solution but a partial one.
Upvotes: 0
Views: 482
Reputation: 1648
Here's some fake data, with a counts
dataset:
data1 <- tibble(id = 1:30,
region = sample(letters[1:3], 30, replace = TRUE),
label = sample(0:1, 30, replace = TRUE))
counts <- data1 %>% group_by(region, label) %>% count()
data2 <- tibble(id = 1:300,
region = sample(letters[1:3], 300, replace = TRUE),
label = sample(0:1, 300, replace = TRUE))
sample_n
usually helps here, but won't take separate size
arguments for each group. So we join the counts
dataset, split
on the region, label
variables, map
sample_n
to each list, using size = n
(where n
is from count), and then use bind_rows
to put the list of dataframes back together:
data2 %>%
left_join(counts) %>%
split(list(data2$region, data2$label)) %>%
map(~ sample_n(.x, size = unique(.x$n))) %>%
bind_rows()
If your datasets are much different you may have to use replace = TRUE
in the sample_n
function.
Upvotes: 1