user529295
user529295

Reputation: 189

Generating linearly independent columns for a matrix

As the title suggests, I want to generate a random N x d matrix (N - number of examples, d - number of features) where each column is linearly independent of the other columns. How can I implement the same using numpy and python?

Upvotes: 4

Views: 1935

Answers (3)

StupidWolf
StupidWolf

Reputation: 46958

You can still have a small degree of correlation, simply by chance, if your number of observations is small.

One way of ensuring that, is to using the principal component scores. So brief explanation from wiki:

Repeating this process yields an orthogonal basis in which different individual dimensions of the data are uncorrelated. These basis vectors are called principal components, and several related procedures principal component analysis (PCA).

We can see this below:

from sklearn.decomposition import PCA
import numpy as np
import seaborn as sns

N = 50
d = 20

a = np.random.normal(0,1,(50,20))
pca = PCA(n_components=d)
pca.fit(a)
pc_scores = pca.transform(a)

fig, ax = plt.subplots(1, 2,figsize=(10,4))
sns.heatmap(np.corrcoef(np.transpose(a)),ax=ax[0],cmap="YlGnBu")
sns.heatmap(np.corrcoef(np.transpose(pc_scores)),ax=ax[1],cmap="YlGnBu")

enter image description here

The heatmap on the matrix shows you can still have some degree of correlation by chance (drawing from a standard normal, but small sample size).

Upvotes: 0

FBruzzesi
FBruzzesi

Reputation: 6495

One way to guarantee random independent columns would be to iteratively add a random column and check matrix rank:

import numpy as np

N, d = 1000, 200
M = np.random.rand(N,1)
r = 1 #matrix rank

while r < d:
    t = np.random.rand(N,1)

    if np.linalg.matrix_rank(np.hstack([M,t])) > r:
        M = np.hstack([M,t])
        r+=1

However this process is quite slow since requires to compute the rank of a matrix at least d times.

A faster approach would be to generate a random Nxd 2d-array and check its rank:

M = np.random.rand(N,d)
r = np.linalg.matrix_rank(M)

while r < d:
    M = np.random.rand(N,d)
    r = np.linalg.matrix_rank(M) 

Which is likely to never enter the while loop, however we add a check and eventually generate another random 2d-array.

Upvotes: 1

Trevor Galivan
Trevor Galivan

Reputation: 383

If you just generate the vectors at random, the chance that the column vectors will not be linearly independent is very very small (Assuming N >= d).

Let A = [B | x] where A is a N x d matrix, B is an N x (d-1) matrix with independent column vectors, and x is a column vector with N elements. The set of all x with no constraints is a subspace with dimension N, while the set of all x such that x is NOT linearly independent with all column vectors in B would be a subspace with dimension d-1 (since every column vector in B serves as a basis vector for this set).

Since you are dealing with bounded, discrete numbers (likely doubles, floats, or integers), the probability of the matrix not being linearly independent will not be exactly zero. The more possible values each element can take, in general, the more likely the matrix is to have independent column vectors.

Therefore, I recommend you chose elements at random. You can always verify after the fact that the matrix has linearly independent column vectors by calculating it's column-echelon form. You could do this with np.random.rand(N,d).

Upvotes: 2

Related Questions