Guido
Guido

Reputation: 6772

Spectral clustering on sparse dataset

I am applying spectral clustering (sklearn.cluster.SpectralClustering) on a dataset with quite some features that are relatively sparse. When doing spectral clustering in Python, I get the following warning:

UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn("Graph is not fully connected, spectral embedding"

This is often followed by an error like this one:

`
File "****.py", line 120, in perform_clustering_spectral_clustering
  predicted_clusters = cluster.SpectralClustering(n_clusters=n).fit_predict(features)
File "****\sklearn\base.py", line 349, in fit_predict
  self.fit(X)
File "****\sklearn\cluster\spectral.py", line 450, in fit
  assign_labels=self.assign_labels)
File "****\sklearn\cluster\spectral.py", line 256, in spectral_clustering
  eigen_tol=eigen_tol, drop_first=False)
File "****\sklearn\manifold\spectral_embedding_.py", line 297, in spectral_embedding
  largest=False, maxiter=2000)
File "****\scipy\sparse\linalg\eigen\lobpcg\lobpcg.py", line 462, in lobpcg
  activeBlockVectorBP, retInvR=True)
File "****\scipy\sparse\linalg\eigen\lobpcg\lobpcg.py", line 112, in _b_orthonormalize
  gramVBV = cholesky(gramVBV)
File "****\scipy\linalg\decomp_cholesky.py", line 81, in cholesky
  check_finite=check_finite)
File "****\scipy\linalg\decomp_cholesky.py", line 30, in _cholesky
  raise LinAlgError("%d-th leading minor not positive definite" % info)
numpy.linalg.linalg.LinAlgError: 9-th leading minor not positive definite
numpy.linalg.linalg.LinAlgError: 9-th leading minor not positive definite
numpy.linalg.linalg.LinAlgError: the leading minor of order 12 of 'b' is not positive definite. The factorization of 'b' could not be completed and no eigenvalues or eigenvectors were computed.`

However, this warning/error does not always occur when using the same settings (i.e. its behaviour is not very consistent, making it hard to test). It occurs for different values of n_clusters, but it happens more often for values n=2 and n > 7 (that is my brief experience at least; as I mentioned, its behaviour is not very consistent).

How should I cope with this warning and related error? Does it depend on the amount of features? What if I add more?

Upvotes: 2

Views: 2997

Answers (1)

moron
moron

Reputation: 69

I also encountered this problem with n_clusters. As this is unsupervised ML there is no single correct value for n_clusters. In your case it seems like n_cluster lies between 3 and 7. Assuming you have some ground truth to clustering best way to handle would be to try few values of n_cluster to see if any pattern emerges for given data set while making sure to avoid any over-fitting. You may also use silhouette coefficient (sklearn.metrics.silhouette_score)

Upvotes: 1

Related Questions