Do you consider the different probabilities of events when doing a random sample? (in R)

Question

I'm not super knowledgable in R, so would really appreciate any help. Thanks in advance!

A) My main issue is whether I should be considering the different probabilities of events in a random sample and how exactly to do that.

For instance, out of the oranges I have, 71% are super juicy, 22% are medium juicy, and 7% are mildly juicy. There are three distributions I am drawing from: super juicy, medium juicy, and mild juicy (based on juice level). Super juicy has mean of 400 and sd of 100. Medium juicy has mean of 300 and sd of 75. Mild juicy has mean of 200 and sd of 60.

I want to create a juiciness rating for the juice I'll be making out of the oranges. The juiciness rating is defined as the mean rating of the oranges used in the juice.

Since I want to do a random sample of 15 oranges, my code looks like this:

set.seed(4000)

samples=rnorm(15, mean=c(400,300,200), sd=c(100,75,60))

This should spit out the 15 randomly sampled oranges and their respective juiciness rating. Then, to make the rating of the entire juice, I do:

rating.juice=mean(samples)

rating.juice

Is this correct? I'm not sure if I should consider the fact that out of the oranges, 71% are super juicy, 22% are medium juicy, and 7% are mild juicy.

Vons · Accepted Answer

A) This is not correct. This generates 5 super juicy oranges, 5 medium juicy oranges, and 5 mild juicy oranges. The mean vector you give and the standard deviation vector you give are recycled every 3 draws. To be clearer what this does, see the following code and output. You see that the first and fourth element have a mean of 400, the second and fifth have a mean of 0, and the third and sixth have a mean of -200.

> samples=rnorm(6, mean=c(400,0,-200), sd=c(100,75,60))
> samples
[1]  360.82620   49.81907 -254.86976  347.60612  -12.95888 -220.26652

I would generate a random uniform on the interval 0 and 1 and if it is between 0 and .71, draw from the super juicy distribution, if it is between .71 and .93, draw from the medium juicy distribution, and otherwise draw from mild juicy distribution.

set.seed(4000)
oranges=numeric(0)
for (i in 1:15) {
  prob=runif(1)
  orange=numeric(0)
  if (prob < .71) {
    orange=rnorm(1, 400, 100)
  } else if (prob < .93) {
    orange=rnorm(1, 300, 75) 
  } else {
    orange=rnorm(1, 200, 60)
  }
  oranges[i]=orange
}

> print(mean(oranges))
[1] 330.9605

B) This is correct. You would not consider the 71% super juicy, etc. here because you have predetermined draws of the number of super juicy, medium juicy, and mild juicy oranges. The place where you would consider the probabilities would be in part A.

C) To find the probability, you could use simulation. It is hard to find a closed-form solution for this.

juicyA=function() {
  oranges=numeric(0)
  for (i in 1:15) {
    prob=runif(1)
    orange=numeric(0)
    if (0 < prob & prob < .71) {
      orange=rnorm(1, 400, 100)
    } else if (prob < .93) {
      orange=rnorm(1, 300, 75) 
    } else {
      orange=rnorm(1, 200, 60)
    }
    oranges[i]=orange
  }
  return(mean(oranges))
}
in_range=0
for (i in 1:10000) {
  juiciness=juicyA()
  if (juiciness > 250 & juiciness < 300) {
    in_range=in_range+1
  }
}

> print(in_range/10000)
[1] 0.0148

Advice Something similar can be done to find the probability that the juiciness will be between 250 and 300 for the oranges drawn from part B. Note we are using the definition of probability here, that is the long-run proportion of successes.

Do you consider the different probabilities of events when doing a random sample? (in R)

Answers (1)

Related Questions