Reputation: 323
I have the following toy dataset
set.seed(100)
df <- data.frame(ID = rep(1:5, each = 3),
value = sample(LETTERS, 15, replace = TRUE),
weight = rep(c(0.1, 0.1, 0.5, 0.2, 0.1), each = 3))
df
ID value weight
1 1 I 0.1
2 1 G 0.1
3 1 O 0.1
4 2 B 0.1
5 2 M 0.1
6 2 M 0.1
7 3 V 0.5
8 3 J 0.5
9 3 O 0.5
10 4 E 0.2
11 4 Q 0.2
12 4 W 0.2
13 5 H 0.1
14 5 K 0.1
15 5 T 0.1
where each ID is an individual respondent, answering 3 questions (in the actual dataset, the number of questions answered is variable, so I can't rely on a certain number of rows per ID).
I want to create a new (larger) dataset which samples from the individual IDs based on the weights in weight
.
probs <- data.frame(ID = unique(df$ID))
probs$prob <- NA
for(i in 1:nrow(probs)){
probs$prob[i] <- df[df$ID %in% probs$ID[i],]$weight[1]
}
probs$prob <- probs$prob / sum(probs$prob)
sampledIDs <- sample(probs$ID, size = 10000, replace = TRUE, prob = probs$prob)
head(sampledIDs,10)
[1] 4 3 3 3 4 4 2 4 2 3
Moving from the probabilistic sampling of IDs to the actual creation of the new data.frame
is stumping me. I've tried
dfW <- df[df$ID %in% sampledIDs,]
but that obviously doesn't take into account the fact that IDs repeat. I've also tried a loop:
dfW <- df[df$ID == sampledIDs[1],]
for(i in 2:length(sampledIDs)){
dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
}
but that's painfully slow with a large dataset.
Any help would be very appreciated.
(Also, if there are simpler ways of doing the probabilistic selection of IDs, that would be great to hear too!)
Upvotes: 0
Views: 250
Reputation: 190
If you don't know the final size you can resize it whenever needed, but a new if condition should be added in the for
loop. First define the function to resize the dataframe as follow:
double_rowsize <- function(df) {
mdf <- as.data.frame(matrix(, nrow = nrow(df), ncol = ncol(df)))
colnames(mdf) <- colnames(df)
df <- rbind(df, mdf)
return(df)
}
Then start the dfW
with an initial size like 12 (3 times 4):
dfW <- as.data.frame(matrix(nrow = 12, ncol = 3))
colnames(dfW) <- colnames(df)
And finally add an if
condition in the for
loop to resize the dataframe whenever needed:
for(i in 1:length(sampledIDs)){
if (3*i > nrow(dfW))
dfW <- double_rowsize(dfW)
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
You can change the details of function double_rowsize to change the dataframe size with a different number rather than 2 if anything else works better. 2 is common because it works best in array resizing.
Good luck!
Upvotes: 0
Reputation: 190
The code speed is low because you resize the data frame in every cycle of the for
loop. Here is my suggestion. Create a dataframe with the final size that the data framedfW
will have before the for
loop. Then assign the values from data frame df
to dfW
in the for
loop. You may change the last part of your code with this:
dfW <- as.data.frame(matrix(nrow = 3 * length(sampledIDs), ncol = 3))
colnames(dfW) <- colnames(df) # make the column names the same
for(i in 1:length(sampledIDs)){ # notice the start index is changed from 2 to 1
#dfW <- rbind(dfW, df[df$ID == sampledIDs[i],])
dfW[(3*i-2):(3*i),] <- df[df$ID == sampledIDs[i],]
}
Your code should run much faster with this change. Let me know how it goes!
Upvotes: 1