Reputation: 568
I have the following data structure:
set.seed(100)
x <- data.frame("smp_1"=runif(20)*100,"smp_2"=runif(20)*99)
x["weight_1"] = x$smp_1/sum(x$smp_1)
x["weight_2"] = x$smp_2/sum(x$smp_2)
> head(x)
smp_1 smp_2 weight_1 weight_2
1 66.61718 68.976341 0.05721288 0.061115678
2 24.65804 77.966842 0.02117709 0.069081607
3 66.10397 1.611913 0.05677212 0.001428216
4 93.95866 1.793973 0.08069459 0.001589529
5 19.96638 31.008240 0.01714774 0.027474488
6 66.35187 97.033923 0.05698502 0.085975770
now I want to create a new data frame which samples from each smp column using the weight columns as the probabilities and add each column sample into a new data frame and a new column. I can do this using a for loop:
tempdf <- data.frame(matrix(0,ncol=0,nrow=1000))
for (k in 1:2){
tempdf[,paste0("sim_",k)] <- sample(x[,paste0("smp_",k)],size=1000, replace=T, prob = x[,paste0("weight_",k)])
}
my question is how can I do this without a for loop in a more efficient way? I will be sampling 100k of multiple columns so I need something quite quick.
Upvotes: 2
Views: 57
Reputation: 887118
Here is one option with tidyverse
using map2
, we subset the columns 'smp', 'weight', and use the correspoing 'weight' to sample
the 'smp' columns
library(tidyverse)
map2_df(x %>%
dplyr::select(matches("^smp")),
x %>%
dplyr::select(matches("^weight")), ~
sample(.x, size = 1000, replace = TRUE, prob = .y))
Upvotes: 0
Reputation: 388982
In base R, we can separate columns for "smp"
and weigths and use mapply
(which BTW internally is still a loop) to sample values.
sample_col <- grep("^smp", names(x))
weigth_col <- grep("^weight", names(x))
mapply(function(p, q) sample(p, size = 1000, replace = TRUE, prob = q),
x[,sample_col], x[,weigth_col])
# smp_1 smp_2
# [1,] 62.499648 74.148250
# [2,] 88.216552 94.461613
# [3,] 55.232243 70.369581
# [4,] 28.035384 74.148250
# [5,] 39.848790 76.259859
# [6,] 39.848790 97.966850
# [7,] 88.216552 91.922002
# [8,] 20.461216 97.966850
# [9,] 66.902171 53.045304
#[10,] 54.655860 76.259859
#...
Upvotes: 0
Reputation: 27732
Here is a data.table
approach.
In the answer ans
, the variable-value (1 or 2) is your k
.
library(data.table)
#melt to long format
DT <- melt( setDT(x) ,
id.vars = NULL,
measure.vars = patterns( smp = "^smp",
weight = "^weight"))
#pull samples
ans <- DT[ , .( sim = sample( smp,
size = 1000,
replace = TRUE,
prob = weight)),
by = .(variable) ]
# variable sim
# 1: 1 69.02905
# 2: 1 30.77661
# 3: 1 37.03205
# 4: 1 35.75249
# 5: 1 48.37707
# 6: 1 55.23224
Upvotes: 2