Filippo
Filippo

Reputation: 33

How can I create a normal distributed set of data in R?

I'm a newbie in statistics and I'm studying R. I decided to do this exercise to pratice some analysis with an original dataset.

This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score. This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).

Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x). So I searched another Rcommand and found this, but I couldn't decide the min\max:

ex1 <- round(rnorm(100, mean=48 , sd=5))    

I really can't understand what I have to do!

I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later... Any help?

Thanks a lot in advance guys

Upvotes: 0

Views: 1236

Answers (1)

Gregor Thomas
Gregor Thomas

Reputation: 146110

The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:

ex1 <- round(rnorm(100, mean=48 , sd=5))   
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)

You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:

pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22

This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.

This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.

(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)

A better approach

A better approach might be to choose a different distribution that does have a minimum and maximum and can be configured to a shape that you like. The beta distribution, for example, is bounded between 0 and 1. If you multiply this by 70, it will be between 0 and 70, and then you could round to the nearest integer. A beta distribution with parameters alpha (shape1) = 4 and beta (shape2) = 2 would give you a distribution with 0 relatively unlikely and a mean of 2/3 (or 47, after you. multiply by 70 and round).

{rbeta(1000, shape1 = 4, shape2 = 2) * 70} |>
  round() |>
  hist(main = "Beta version", xlab = "")

enter image description here

Upvotes: 5

Related Questions