Reputation: 77
I have two data frames in R
. One contains a row for each individual person and the area they live in. E.g.
df1 = data.frame(Person_ID = seq(1,10,1), Area = c("A","A","A","B","B","C","D","A","D","C"))
The other data frame contains demographic information for each Area
.
E.g. for gender df2 = data.frame(Area = c("A","A","B","B","C","C","D","D"), gender = c("M","F","M","F","M","F","M","F"), probability = c(0.4,0.6,0.55,0.45,0.6,0.4,0.5,0.5))
In df1
I want to create a gender column where for each row of df1
I sample a gender from the appropriate subset of df2
.
For example, for row 1 of df1
I would sample a gender from df2 %>% filter(Area == "A")
The question is how do I do this for all rows without a for loop as in practice df1
could have up to 5 million rows?
Upvotes: 1
Views: 76
Reputation: 389175
Try using the following :
library(dplyr)
library(tidyr)
out <- df1 %>%
nest(data = -Area) %>%
left_join(df2, by = 'Area') %>%
group_by(Area) %>%
summarise(data = map(data, ~.x %>%
mutate(gender = sample(gender, n(),
prob = probability, replace = TRUE)))) %>%
distinct(Area, .keep_all = TRUE) %>%
unnest(data)
We first nest
df1
and join it with df2
by Area
. For each Area
we sample
gender
value based on probability
in df2
and unnest
to get long dataframe.
There are not enough samples in df1
to verify the result but if we increase number of rows in df1
the proportion should be similar to probability
in df2
.
Upvotes: 1