Reputation: 11
First let me clarify that here "sparse PCA" means PCA with L1 penalty and sparse loadings, not PCA on sparse matrix.
I've read the paper on sparse PCA by Zou and Hastie, I've read the documentation on sklearn.decomposition.SparsePCA, and I know how to use PCA, but I can't seem to get the right result from SparsePCA.
Namely, when L1 penalty is 0, the result from SparsePCA is supposed to agree with PCA, but the loadings differ quite a lot. To make sure that I didn't mess up any hyperparameters, I used the same hyperparameters (convergence tolerance, maximum iterations, ridge penalty, lasso penalty...) in R with 'spca' from 'elasticnet', and R gave me the correct result. I'd rather not have to go through the source code of SparsePCA if anyone has experience using this function and could let me know if I made any mistakes.
Below is how I generated my dataset. It's a bit convoluted because I wanted a specific Markov Decision Process to test some reinforcement learning algorithms. Just treat it as some non-sparse dataset.
import numpy as np
from sklearn.decomposition import PCA, SparsePCA
import numpy.random as nr
def transform(data, TranType=None):
if TranType == 'quad':
data = np.minimum(np.square(data), 3)
if TranType == 'cubic':
data = np.maximum(np.minimum(np.power(data, 3), 3), -3)
if TranType == 'exp':
data = np.minimum(np.exp(data), 3)
if TranType == 'abslog':
data = np.minimum(np.log(abs(data)), 3)
return data
def NewStateGen(OldS, A, TranType, m=0, sd=0.5, nsd=0.1, dim=64):
# dim needs to be a multiple of 4, and preferably a multiple of 16.
assert (dim == len(OldS) and dim % 4 == 0)
TrueDim = dim / 4
NewS = np.zeros(dim)
# Generate new state according to action
if A == 0:
NewS[range(0, dim, 4)] = transform(OldS[0:TrueDim], TranType) + \
nr.normal(scale=nsd, size=TrueDim)
NewS[range(1, dim, 4)] = transform(OldS[0:TrueDim], TranType) + \
nr.normal(scale=nsd, size=TrueDim)
NewS[range(2, dim, 4)] = nr.normal(m, sd, size=TrueDim)
NewS[range(3, dim, 4)] = nr.normal(m, sd, size=TrueDim)
R = 2 * np.sum(transform(OldS[0:int(np.ceil(dim / 32.0))], TranType)) - \
np.sum(transform(OldS[int(np.ceil(dim / 32.0)):(dim / 16)], TranType)) + \
nr.normal(scale=nsd)
if A == 1:
NewS[range(0, dim, 4)] = nr.normal(m, sd, size=TrueDim)
NewS[range(1, dim, 4)] = nr.normal(m, sd, size=TrueDim)
NewS[range(2, dim, 4)] = transform(OldS[0:TrueDim], TranType) + \
nr.normal(scale=nsd, size=TrueDim)
NewS[range(3, dim, 4)] = transform(OldS[0:TrueDim], TranType) + \
nr.normal(scale=nsd, size=TrueDim)
R = 2 * np.sum(transform(OldS[int(np.floor(dim / 32.0)):(dim / 16)], TranType)) - \
np.sum(transform(OldS[0:int(np.floor(dim / 32.0))], TranType)) + \
nr.normal(scale=nsd)
return NewS, R
def MDPGen(dim=64, rep=1, n=30, T=100, m=0, sd=0.5, nsd=0.1, TranType=None):
X_all = np.zeros(shape=(rep*n*T, dim))
Y_all = np.zeros(shape=(rep*n*T, dim+1))
A_all = np.zeros(rep*n*T)
R_all = np.zeros(rep*n*T)
for j in xrange(rep*n):
# Data for a single subject
X = np.zeros(shape=(T+1, dim))
A = np.zeros(T)
R = np.zeros(T)
NewS = np.zeros(dim)
X[0] = nr.normal(m, sd, size=dim)
for i in xrange(T):
OldS = X[i]
# Pick a random action
A[i] = nr.randint(2)
# Generate new state according to action
X[i+1], R[i] = NewStateGen(OldS, A[i], TranType, m, sd, nsd, dim)
Y = np.concatenate((X[1:(T+1)], R.reshape(T, 1)), axis=1)
X = X[0:T]
X_all[(j*T):((j+1)*T)] = X
Y_all[(j*T):((j+1)*T)] = Y
A_all[(j*T):((j+1)*T)] = A
R_all[(j*T):((j+1)*T)] = R
return {'X': X_all, 'Y': Y_all, 'A': A_all, 'R': R_all, 'rep': rep, 'n': n, 'T': T}
nr.seed(1)
MDP = MDPGen(dim=64, rep=1, n=30, T=90, sd=0.5, nsd=0.1, TranType=None)
X = MDP.get('X').astype(np.float32)
Now I run PCA and SparsePCA. When the lasso penalty, 'alpha', is 0, SparsePCA is supposed to give the same result as PCA, which is not the case. The other hyperparameters are set with the default values from elasticnet in R. If I use the default from SparsePCA the result will still be incorrect.
PCA_model = PCA(n_components=64)
PCA_model.fit(X)
Z = PCA_model.transform(X)
SPCA_model = SparsePCA(n_components=64, alpha=0, ridge_alpha=1e-6, max_iter=200, tol=1e-3)
SPCA_model.fit(X)
SZ = SPCA_model.transform(X)
# Check the first 2 loadings from PCA and SPCA. They are supposed to agree.
print PCA_model.components_[0:2]
print SPCA_model.components_[0:2]
# Check the first 2 observations of transformed data. They are supposed to agree.
print Z[0:2]
print SZ[0:2]
When the lasso penalty is greater than 0, the result from SparsePCA is still quite different from what R gives me, and the latter is correct based on manual inspection and what I learned from the original paper. So, is SparsePCA broken, or did I miss anything?
Upvotes: 1
Views: 1496
Reputation: 1996
The answers aren't different. First, I thought it may be the solvers, but checking for different solvers, I get almost identical loadings. See this:
nr.seed(1)
MDP = MDPGen(dim=16, rep=1, n=30, T=90, sd=0.5, nsd=0.1, TranType=None)
X = MDP.get('X').astype(np.float32)
PCA_model = PCA(n_components=10,svd_solver='auto',tol=1e-6)
PCA_model.fit(X)
SPCA_model = SparsePCA(n_components=10, alpha=0, ridge_alpha=0)
SPCA_model.fit(X)
PC1 = PCA_model.components_[0]/np.linalg.norm(PCA_model.components_[0])
SPC1 = SPCA_model.components_[0].T/np.linalg.norm(SPCA_model.components_[0])
print(np.dot(PC1,SPC1))
import pylab
pylab.plot(PC1)
pylab.plot(SPC1)
pylab.show()
Upvotes: 0
Reputation: 33532
As often: there are many different formulations & implementations.
sklearn is using a different implementation with different characteristics.
Let's have a look how they differ:
So it seems sklearn is at least doing something different in regards to the l2-norm based component (it's missing).
This is by design as this is the basic form within the area of dictionary-learning: (algorithm-paper linked by sklearn used for implementation).
It is quite possible, that this alternative formulation is not guaranteeing (or does not care at all) to emulate classic PCA when the sparsity-parameter is zero (which is not really surprising as these problems differ a lot in regards to optimization-theory and sparsePCA has to reside to some heuristic-based algorithm as the problem itself is NP-hard, ref). This idea is strengthened by the describing of the equivalence theorem here:
Upvotes: 3