Reputation: 3384
data1=data.frame("School"=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5),
"Group"=c(1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2),
"Class"=c('A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C'),
"Size"=c(459,441,410,201,327,156,129,427,249,331,477,458,288,472,275,449,424,469,386,387,103,320,284,277,481,167,348,247,115,193))
data2=data.frame("ID"=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30),
"Group"=c(2,2,2,2,1,1,1,2,2,2,2,2,2,1,1,2,1,2,1,1,2,2,2,2,2,1,1,1,1,1),
"Class"=c('A','B','C','B','C','C','A','A','A','B','B','A','B','A','C','C','B','A','B','A','B','C','B','A','C','B','B','C','C','B'),
"Funds"=c(5,8,9,4,6,3,5,7,6,7,6,7,6,7,6,7,4,9,5,7,5,5,5,7,9,7,6,9,4,7),
"Ratio"=c(2,2,3,1,1,2,3,3,1,3,3,3,3,1,1,3,2,1,1,1,3,1,3,3,1,2,1,3,1,3),
"WEIGHT"=c(162,65,104,118,105,49,107,93,25,24,103,115,64,186,57,123,86,181,70,111,154,135,37,73,127,148,188,169,73,52))
I have data1 and data2 and hope to provide a simple example.
In data1 you see the 'School' number and the 'Size' of the 'Group' X 'Class' combinations. In data2 you have information on 'Funds' and 'Ratio' for each of the 'Group' X 'Class' combinations. In data2 you also have 'WEIGHT' which equals to the sampling frame, which is to say that different 'ID's have different "weight" or importance.
I wish to create data3 which would look like this:
Overall I wish for data3 to have sum(data1$Size)
rows. I wish to copy 'Size' rows for each 'School' X 'Group' X 'Class' combination from data1. Then I wish to sample with replacement by 'Group' X 'Class' 'Funds' and 'Ratio' from data2, using WEIGHTS to inform the probability of picking each row, to fill in data3.
Upvotes: 1
Views: 270
Reputation: 886938
Here is an option with data.table
where we expand the data1 based on the 'Size' column, then split both datasets by 'Group', 'Class' and use Map to create the columns in first dataset based on a probablity created with WEIGHT/sum(WEIGHT)
to be used in the sample
for sampling the 'Funds', 'Ratio' column values, assign (:=
) it to create new columns in the list
and finally rbind
the list elements with
rbindlist`
library(data.table)
dt1 <- setDT(data1)[rep(seq_len(nrow(data1)), Size)]
lst1 <- split(dt1, dt1[, .(Group, Class)])
lst2 <- split(data2, data2[c('Group', 'Class')], drop = TRUE)
out <- rbindlist(Map(function(x, y) {
prb = y$WEIGHT/sum(y$WEIGHT)
x[, c('Funds', 'Ratio') :=
.(sample(y$Funds, size = .N, replace = TRUE, prob=prb),
sample(y$Ratio, size = .N, replace = TRUE, prob=prb) )]},
lst1, lst2[names(lst1)]))
out[, Size := NULL]
sum(data1$Size)
#[1] 9750
nrow(out)
#[1] 9750
Upvotes: 1