Reputation: 93
I have two datasets. I want to assign for every row of second dataset a "Group" from first dataset based on "Probabilities" from first dataset. Logic what I want to do is as follows:
my datasets:
Period | Group | socio_demo | Probabilities |
---|---|---|---|
01.2021-03.2021 | 1 | 20yo_West_old_3child | 60 |
01.2021-03.2021 | 2 | 20yo_West_old_3child | 20 |
01.2021-03.2021 | 3 | 20yo_West_old_3child | 10 |
01.2021-03.2021 | 4 | 20yo_West_old_3child | 10 |
01.2021-03.2021 | 1 | 20yo_West_old_2child | 50 |
01.2021-03.2021 | 2 | 20yo_West_old_2child | 20 |
01.2021-03.2021 | 3 | 20yo_West_old_2child | 25 |
01.2021-03.2021 | 4 | 20yo_West_old_2child | 5 |
Period | ID | socio_demo | Weight |
---|---|---|---|
01.2021-03.2021 | 1 | 20yo_West_old_3child | 0.1 |
01.2021-03.2021 | 2 | 20yo_West_old_3child | 0.11 |
01.2021-03.2021 | 3 | 20yo_West_old_3child | 0.69 |
01.2021-03.2021 | 4 | 20yo_West_old_3child | 0.14 |
01.2021-03.2021 | 5 | 20yo_West_old_3child | 0.98 |
01.2021-03.2021 | 6 | 20yo_West_old_3child | 0.89 |
01.2021-03.2021 | 7 | 20yo_West_old_3child | 0.45 |
01.2021-03.2021 | 8 | 20yo_West_old_3child | 0.1 |
01.2021-03.2021 | 9 | 20yo_West_old_3child | 0.4 |
01.2021-03.2021 | 10 | 20yo_West_old_3child | 0.3 |
I want to obtain:
Period | ID | socio_demo | Weight | Group |
---|---|---|---|---|
01.2021-03.2021 | 1 | 20yo_West_old_3child | 0.1 | 1 |
01.2021-03.2021 | 2 | 20yo_West_old_3child | 0.11 | 1 |
01.2021-03.2021 | 3 | 20yo_West_old_3child | 0.69 | 2 |
01.2021-03.2021 | 4 | 20yo_West_old_3child | 0.14 | 1 |
01.2021-03.2021 | 5 | 20yo_West_old_3child | 0.98 | 4 |
01.2021-03.2021 | 6 | 20yo_West_old_3child | 0.89 | 1 |
01.2021-03.2021 | 7 | 20yo_West_old_3child | 0.45 | 3 |
01.2021-03.2021 | 8 | 20yo_West_old_3child | 0.1 | 1 |
01.2021-03.2021 | 9 | 20yo_West_old_3child | 0.4 | 2 |
01.2021-03.2021 | 10 | 20yo_West_old_3child | 0.3 | 1 |
I tried:
df2 %>%
left_join(df1) %>%
tidyr::uncount(Probabilities) %>%
group_by(Period, ID) %>%
slice_sample(n = 1) %>%
ungroup()
But unfortunately this is giving me error: *Long vectors are not yet supported. Requested output must be less than 2147483647. *
Thank you for helping
Upvotes: 0
Views: 36
Reputation: 401
A sample of R code to carry out the task you described is provided here:
library(dplyr)
df1 <- data.frame(Period = c("01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021"),
Group = c(1,2,3,4,1,2,3,4),
socio_demo = c("20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_2child", "20yo_West_old_2child", "20yo_West_old_2child", "20yo_West_old_2child"),
Probabilities = c(60,20,10,10,50,20,25,5))
df2 <- data.frame(Period = c("01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021", "01.2021-03.2021"),
ID = c(1,2,3,4,5,6,7,8,9,10),
socio_demo = c("20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child", "20yo_West_old_3child"),
Weight = c(0.1,0.11,0.69,0.14,0.98,0.89,0.45,0.1,0.4,0.3))
result_df <- df2 %>%
group_by(Period, socio_demo) %>%
mutate(Group = sample(df1$Group, n(), replace = TRUE, prob = df1$Probabilities/sum(df1$Probabilities)))
Before defining the two data frames df1 and df2, we loaded the dplyr library in the preceding code. The operations you described were then carried out with the help of the dplyr library's group_by and mutate functions: grouping the df2 data frame by socio_demo and Period, and then randomly assigning a Group to each row of df2 based on the probabilities of each Group in df1.
# Groups: Period, socio_demo [1]
Period ID socio_demo Weight Group
<chr> <dbl> <chr> <dbl> <dbl>
1 01.2021-03.2021 1 20yo_West_old_3child 0.1 3
2 01.2021-03.2021 2 20yo_West_old_3child 0.11 1
3 01.2021-03.2021 3 20yo_West_old_3child 0.69 3
4 01.2021-03.2021 4 20yo_West_old_3child 0.14 1
5 01.2021-03.2021 5 20yo_West_old_3child 0.98 1
6 01.2021-03.2021 6 20yo_West_old_3child 0.89 1
7 01.2021-03.2021 7 20yo_West_old_3child 0.45 1
8 01.2021-03.2021 8 20yo_West_old_3child 0.1 1
9 01.2021-03.2021 9 20yo_West_old_3child 0.4 1
10 01.2021-03.2021 10 20yo_West_old_3child 0.3 1
Upvotes: 1