80gm2
80gm2

Reputation: 202

Using rnorm() to generate data sets

I need to generate a data set which contains 20 observations in 3 classes (20 observations to each of the classes - 60 in total) with 50 variables. I have tried to achieve this by using the code below, however it throws an error and I end up creating 2 observations of 50 variables.

data = matrix(rnorm(20*3), ncol = 50)
Warning message:
In matrix(rnorm(20 * 3), ncol = 50) :
data length [60] is not a sub-multiple or multiple of the number of columns [50]

I would like to know where I am going wrong, or even if this is the best way to generate a data set, and some explanations of possible solutions so I can better understand how to do this in the future.

Upvotes: 3

Views: 5126

Answers (1)

LyzandeR
LyzandeR

Reputation: 37879

The below can probably be done in less than my 3 lines of code but I want to keep it simple and I also want to use the matrix function with which you seem to be familiar:

#for the response variable y (60 values - 3 classes 1,2,3  - 20 observations per class)
y <- rep(c(1,2,3),20 ) #could use sample instead if you want this to be random as in docendo's answer

#for the matrix of variables x
#you need a matrix of 50 variables i.e. 50 columns and 60 rows i.e. 60x50 dimensions (=3000 table cells)
x <- matrix( rnorm(3000), ncol=50 )

#bind the 2 - y will be the first column 
mymatrix <- cbind(y,x)

> dim(x) #60 rows , 50 columns
[1] 60 50
> dim(mymatrix) #60 rows, 51 columns after the addition of the y variable
[1] 60 51

Update

I just wanted to be a bit more specific about the error that you get when you try matrix in your question.

  1. First of all rnorm(20*3) is identical to rnorm(60) and it will produce a vector of 60 values from the standard normal distribution.
  2. When you use matrix it fills it up with values column-wise unless otherwise specified with the byrow argument. As it is mentioned in the documentation:

If one of nrow or ncol is not given, an attempt is made to infer it from the length of data and the other parameter. If neither is given, a one-column matrix is returned.

And the logical way to infer it is by the equation n * m = number_of_elements_in_matrix where n and m are the number of rows and columns of the matrix respectively. In your case your number_of_elements_in_matrix was 60 and the column number was 50. Therefore, the number of rows had to be 60/50=1.2 rows. However, a decimal number of rows doesn't make any sense and thus you get the error. Since you chose 50 columns only multiples of 50 will be accepted as the number_of_elements_in_matrix. Hope that's clear!

Upvotes: 3

Related Questions