Reputation: 189
As the title suggests, I want to generate a random N x d matrix (N - number of examples, d - number of features) where each column is linearly independent of the other columns. How can I implement the same using numpy and python?
Upvotes: 4
Views: 1935
Reputation: 46958
You can still have a small degree of correlation, simply by chance, if your number of observations is small.
One way of ensuring that, is to using the principal component scores. So brief explanation from wiki:
Repeating this process yields an orthogonal basis in which different individual dimensions of the data are uncorrelated. These basis vectors are called principal components, and several related procedures principal component analysis (PCA).
We can see this below:
from sklearn.decomposition import PCA
import numpy as np
import seaborn as sns
N = 50
d = 20
a = np.random.normal(0,1,(50,20))
pca = PCA(n_components=d)
pca.fit(a)
pc_scores = pca.transform(a)
fig, ax = plt.subplots(1, 2,figsize=(10,4))
sns.heatmap(np.corrcoef(np.transpose(a)),ax=ax[0],cmap="YlGnBu")
sns.heatmap(np.corrcoef(np.transpose(pc_scores)),ax=ax[1],cmap="YlGnBu")
The heatmap on the matrix shows you can still have some degree of correlation by chance (drawing from a standard normal, but small sample size).
Upvotes: 0
Reputation: 6495
One way to guarantee random independent columns would be to iteratively add a random column and check matrix rank:
import numpy as np
N, d = 1000, 200
M = np.random.rand(N,1)
r = 1 #matrix rank
while r < d:
t = np.random.rand(N,1)
if np.linalg.matrix_rank(np.hstack([M,t])) > r:
M = np.hstack([M,t])
r+=1
However this process is quite slow since requires to compute the rank of a matrix at least d
times.
A faster approach would be to generate a random Nxd
2d-array and check its rank:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
while r < d:
M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)
Which is likely to never enter the while loop, however we add a check and eventually generate another random 2d-array.
Upvotes: 1
Reputation: 383
If you just generate the vectors at random, the chance that the column vectors will not be linearly independent is very very small (Assuming N >= d).
Let A = [B | x] where A is a N x d matrix, B is an N x (d-1) matrix with independent column vectors, and x is a column vector with N elements. The set of all x with no constraints is a subspace with dimension N, while the set of all x such that x is NOT linearly independent with all column vectors in B would be a subspace with dimension d-1 (since every column vector in B serves as a basis vector for this set).
Since you are dealing with bounded, discrete numbers (likely doubles, floats, or integers), the probability of the matrix not being linearly independent will not be exactly zero. The more possible values each element can take, in general, the more likely the matrix is to have independent column vectors.
Therefore, I recommend you chose elements at random. You can always verify after the fact that the matrix has linearly independent column vectors by calculating it's column-echelon form. You could do this with np.random.rand(N,d).
Upvotes: 2