Ollie
Ollie

Reputation: 664

Drawing from two distributions with a probability in R

I am trying to draw from two different distributions with a probability 100000 times. Unfortunately I can't see what is wrong with my for loop, however, it only adds 1 value to simulated_data instead of the desired 100,000 values.

Question 1: How can I fix this?

Question 2: Is there a far more efficient method where I don't have to loop through 100,000 items in a list?

#creating a vector of probabilities
probabilities <- rep(0.99,100000)
#creating a vector of booleans
logicals <- runif(length(probabilities)) < probabilities

#empty list for my simulated data
simulated_data <- c()

#drawing from two different distributions depending on the value in logicals
for(i in logicals){

  if (isTRUE(i)) {
    simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 1)
  }else{
     simulated_data[i] <- rnorm(n = 1, mean = 0, sd = 10)
   }
}

Upvotes: 1

Views: 1116

Answers (3)

IRTFM
IRTFM

Reputation: 263411

Create a vector with the desired fraction of values from each distribution and then create a random permutation of the values:

N = 10000
frac =0.99
rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )

> table( abs(rand_mix) >1.96)

FALSE  TRUE 
 9364   636 
> (100000-636)/100000
[1] 0.99364

> table( rnorm(10000) >6)

FALSE 
10000 

The fraction is fixed. If you wante a possibly random fraction (but close to 0.99 statistically) then try this:

> table( sample( c( rnorm(10e6), rnorm(10e4, sd=10) ), 10e4) > 1.96 )

FALSE  TRUE 
97151  2849 

Compare with:

> N = 100000
> frac =0.99
> rand_mix = sample( c( rnorm( frac*N, 0, sd=1) , rnorm( (1-frac)*N, 0, sd=10) ) )
> table( rand_mix > 1.96 )

FALSE  TRUE 
97117  2883 

Upvotes: 0

R. Schifini
R. Schifini

Reputation: 9313

It seems that you want to create a final sample where each element is taken randomly from either sample1 or sample2, with probabilities 0.99 and 0.01.

The correct approach would be to generate both samples, each containing the same number of elements and then select randomly from either one.

The correct approach would be:

# Generate both samples
n = 100000
sample1 = rnorm(n,0,1)
sample2 = rnorm(n,0,10)

# Create the logical vector that will decide whether to take from sample 1 or 2
s1_s2 = runif(n) < 0.99

# Create the final sample
sample = ifelse(s1_s2 , sample1, sample2)

In this case, it is not guaranteed that there are exactly 0.99*n samples from sample1 and 0.01*n from sample2. In fact:

> sum(sample == sample1)
[1] 98953

This is close to 0.99*n, as expected, but not exactly.

Upvotes: 2

Ollie
Ollie

Reputation: 664

Here is a nice solution for anyone here:

n <- 100000
prob1 <- 0.99
prob2 <- 1-prob1 

dist1 <- rnorm(prob1*n, 0, 1)
dist2 <- rnorm(prob2*n, 0, 10)

actual_sample <- c(dist1, dist2)

Upvotes: 0

Related Questions