bvowe
bvowe

Reputation: 3384

R Weighted Sampling Procedures

data1=data.frame("School"=c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3,4,4,4,4,4,4,5,5,5,5,5,5),
"Group"=c(1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2,1,1,1,2,2,2),
"Class"=c('A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C','A','B','C'),
"Size"=c(459,441,410,201,327,156,129,427,249,331,477,458,288,472,275,449,424,469,386,387,103,320,284,277,481,167,348,247,115,193))

data2=data.frame("ID"=c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30),
"Group"=c(2,2,2,2,1,1,1,2,2,2,2,2,2,1,1,2,1,2,1,1,2,2,2,2,2,1,1,1,1,1),
"Class"=c('A','B','C','B','C','C','A','A','A','B','B','A','B','A','C','C','B','A','B','A','B','C','B','A','C','B','B','C','C','B'),
"Funds"=c(5,8,9,4,6,3,5,7,6,7,6,7,6,7,6,7,4,9,5,7,5,5,5,7,9,7,6,9,4,7),
"Ratio"=c(2,2,3,1,1,2,3,3,1,3,3,3,3,1,1,3,2,1,1,1,3,1,3,3,1,2,1,3,1,3),
"WEIGHT"=c(162,65,104,118,105,49,107,93,25,24,103,115,64,186,57,123,86,181,70,111,154,135,37,73,127,148,188,169,73,52))

I have data1 and data2 and hope to provide a simple example.

In data1 you see the 'School' number and the 'Size' of the 'Group' X 'Class' combinations. In data2 you have information on 'Funds' and 'Ratio' for each of the 'Group' X 'Class' combinations. In data2 you also have 'WEIGHT' which equals to the sampling frame, which is to say that different 'ID's have different "weight" or importance.

I wish to create data3 which would look like this:

enter image description here

Overall I wish for data3 to have sum(data1$Size) rows. I wish to copy 'Size' rows for each 'School' X 'Group' X 'Class' combination from data1. Then I wish to sample with replacement by 'Group' X 'Class' 'Funds' and 'Ratio' from data2, using WEIGHTS to inform the probability of picking each row, to fill in data3.

Upvotes: 1

Views: 270

Answers (1)

akrun
akrun

Reputation: 886938

Here is an option with data.table where we expand the data1 based on the 'Size' column, then split both datasets by 'Group', 'Class' and use Map to create the columns in first dataset based on a probablity created with WEIGHT/sum(WEIGHT) to be used in the sample for sampling the 'Funds', 'Ratio' column values, assign (:=) it to create new columns in the list and finally rbind the list elements withrbindlist`

library(data.table)
dt1 <- setDT(data1)[rep(seq_len(nrow(data1)), Size)]
lst1 <- split(dt1, dt1[, .(Group, Class)])
lst2 <- split(data2, data2[c('Group', 'Class')], drop = TRUE)

out <- rbindlist(Map(function(x, y) {
            prb = y$WEIGHT/sum(y$WEIGHT)
            x[, c('Funds', 'Ratio') := 
           .(sample(y$Funds,  size = .N, replace = TRUE, prob=prb), 
            sample(y$Ratio,  size = .N, replace = TRUE, prob=prb) )]}, 
            lst1, lst2[names(lst1)]))

out[, Size := NULL]
sum(data1$Size)
#[1] 9750
nrow(out)
#[1] 9750

Upvotes: 1

Related Questions