Reputation: 3
I'm having trouble making synthetic data for some clustering tasks.
I want to create 200 data vectors x of 20 dimensions.
For generating them, I first prepared centers of clusters m
of 5 dimensions and a linear transformation A
of (20, 5) shapes, where all the elements are generated from N(0,1) independently.
Then data vectors are genereted by x_i ~ N(Am_k, I)
, where I
is an identity matrix of (20, 20) shape.
here is the code I wrote,
import numpy as np
A = np.random.normal(0, 1, (20, 5))
m1 = np.random.normal(0, 1, (5,)) # center of clusters
for _ in range(50):
data1.append(np.random.normal(np.matmul(A, m1), np.identity(20)))
x1 = np.vstack(data1) # x1.shape -> (1000, 20)
in this code, I tried to generate 50 data vectors from center of the cluster m1
.
The shape of x1
should be (50,20)
, but the output shape is (1000, 20)
.
So I changed the value of scale from np.identity(20)
to 1
and got the data with the shape of (50, 20)
.
But I'm not sure if such changes guarantee the reproducibility of the data vectors.
The questions are, do np.random.normal(np.matmul(A, m1), np.identity(20))
and np.random.normal(np.matmul(A, m1), 1)
produce the same data?
And how should I inplement x_i ~ N(Am_k, I)
?
edit: number of classes are k=4, and the class of x_i is assigned from 1 to 4 for every 50 data vectors.
Upvotes: 0
Views: 89
Reputation: 902
in this code, I tried to generate 50 data vectors from center of the cluster m1. The shape of x1 should be (50,20), but the output shape is (1000, 20).
So the problem here is that you're using np.random.normal
instead of np.random.multivariate_normal
, so you should really have
for _ in range(50):
data1.append(np.random.multivariate_normal(np.matmul(A, m1), np.identity(20)))
The questions are, do np.random.normal(np.matmul(A, m1), np.identity(20)) and np.random.normal(np.matmul(A, m1), 1) produce the same data? And how should I inplement x_i ~ N(Am_i, I)?
No, they wouldn't produce the same data. The shapes they output aren't the same, so they won't produce the same data. The first one generates something weird where the i-th column changes for the i-th row where it's by default Am_i @ I
, the second one generates what you want, it'll draw each coordinate with variance one independent of the other dimensions. But if you want to explicitly drawn from a multivariate Gaussian in the future, then you can get x_i~N(Am_i,I)
by using np.random.multivariate_normal
.
Upvotes: 1