sand
sand

Reputation: 183

Spectral Clustering Scikit learn print items in Cluster

I know I can get the contents of a particular cluster in K-means clustering with the following code using scikit-learn.

    order_centroids = model.cluster_centers_.argsort()[:, ::-1]
    terms = vectorizer.get_feature_names()
    for i in range(true_k):
        print "Cluster %d:" % i,
        for ind in order_centroids[i, :10]:
            print ' %s' % terms[ind],
        print

How do I do the same for spectral clustering as there is no attribute 'cluster_centers_'for spectral clustering? I am trying to cluster terms in Text documents.

Upvotes: 3

Views: 6245

Answers (3)

Eric A. Bunch
Eric A. Bunch

Reputation: 121

While it's true that you can't get the cluster centers for spectral clustering, you can do something close that might be useful in some cases. To explain, I'll run through the spectral clustering algorithm quickly and explain the modification.

First, let's call our dataset X = {x_1, ..., x_N}, where each point is d-dimensional (d is the number of features you have in your dataset). We can think of X as an N by d matrix. Let's say we want to put this data into k clusters. Spectral clustering first transforms the data set into another representation and then uses K-means clustering on the new representation of the data to obtain clusters. First, the affinity matrix A is formed by using K-neighbors information. For this, we need to choose a positive integer n to construct A. The element A_{i, j} is equal to 1 if x_i and x_j are both in the list of the top n neighbors of each other, and A_{i, j} is equal to 0 otherwise. A is a symmetric N by N matrix. Next, we construct the normalized Laplacian matrix L of A, which is L = I - D^{-1/2}AD^{-1/2}, where D is the degree matrix. Then the eigenvalue decomposition is performed on L to get L = VEV^{-1}, where V is the matrix of eigenvectors of L, and E is the diagonal matrix with the eigenvalues of L in the diagonal. Since L is positive semi-definite, it's eigenvalues are all non-negative real numbers. For spectral clustering, we use this to order the columns of V so that the first column of V corresponds to the smallest eigenvalue of L, and the last column to the largest eigenvalue of L.

Next, we take the first k columns of V, and view them as N points in k-dimensional space. Let's write this truncated matrix as V', and write it's rows as {v'_1, ..., v'_N}, where each v'_i is k-dimensional. Then we use the K-means algorithm to cluster these points into k clusters; {C'_1,...,C'_k}. Then the clusters are assigned to the points in the dataset X by "pulling back" the clusters from V' to X: the point x_i is in cluster C_j if and only if v'_i is in cluster C'_j.

Now, one of the main points of transforming X into V' and clustering on that representation is that often X is not spherically distributed, and V' at least comes closer to being so. Since V' is closer to being spherically distributed, the centroid will be "inside" the cluster of points it defines. We can take the point in V' that is closest to the cluster centroid for each cluster. Let's call the cluster centroids {c_1,...,c_k}. These are points in the parameter space that V' is represented in. Then for each cluster, choose the point of V' that is closest to the cluster's centroid to get k points of V'. Let's say {v'_i_1,...,v'_i_k} are the representative points closest to the cluster centroids of V'. Then choose {x_i_1,...,x_i_k} as the cluster representatives for the clusters of X.

This method might not always work how you might want, but it's at least a way to get closer to what you're wanting, and maybe you can modify it to get closer to what you need. Here's some example code to show how to do this.

Let's use some fake data provided by scikit-learn.

import numpy as np
import pandas as pd
from sklearn.datasets import make_moons

moons_data = make_moons(n_samples=1000, noise=0.07, random_state=0)
moons = pd.DataFrame(data=moons_data[0], columns=['x', 'y'])
moons['label_truth'] = moons_data[1]

moons.plot(
    kind='scatter',
    x='x',
    y='y',
    figsize=(8, 8),
    s=10,
    alpha=0.7
);

noisy_moons

I'm going to kind of cheat and use the spectral clustering method provided by scikit-learn, and then extract the affinity matrix from there.

from sklearn.cluster import SpectralClustering

sclust = SpectralClustering(
    n_clusters=2,
    random_state=42,
    affinity='nearest_neighbors',
    n_neighbors=10,
    assign_labels='kmeans'
)

sclust.fit(moons[['x', 'y']]);

moons['label_cluster'] = sclust.labels_

moons.plot(
    kind='scatter',
    x='x',
    y='y',
    figsize=(16, 14),
    s=10,
    alpha=0.7,
    c='label_cluster',
    cmap='Spectral'
);

noisy_moons_clustered

Next, we'll compute the normalized Laplacian of the affinity matrix, and instead of computing the whole eigenvalue decomposition of the Laplacian, we use the scipy function eigsh to extract the two (since we are wanting two clusters) eigenvectors corresponding to the two smallest eigenvalues.

from scipy.sparse.csgraph import laplacian
from scipy.sparse.linalg import eigsh

affinity_matrix = sclust.affinity_matrix_
lpn = laplacian(affinity_matrix, normed=True)
w, v = eigsh(lpn, k=2, which='SM')

pd.DataFrame(v).plot(
    kind='scatter',
    x=0,
    y=1,
    figsize=(16, 16),
    s=10,
    alpha=0.7
);

eigenvector_plot

Then let's use K-means to cluster on this new representation of the data. Let's also find the two points in this new representation that are closest to the cluster centroids, and highlight them.

from sklearn.cluster import KMeans
from scipy.spatial.distance import euclidean
import matplotlib.pyplot as plt

kmeans = KMeans(
    n_clusters=2,
    random_state=42
)
kmeans.fit(v)
center_0, center_1 = kmeans.cluster_centers_
representative_index_0 = np.argmin(np.array([euclidean(a, center_0) for a in v]))
representative_index_1 = np.argmin(np.array([euclidean(a, center_1) for a in v]))
fig, ax = plt.subplots(figsize=(16, 16));

pd.DataFrame(v).plot(
    kind='scatter',
    x=0,
    y=1,
    ax=ax,
    s=10,
    alpha=0.7);

pd.DataFrame(v).iloc[[representative_index_0, representative_index_1]].plot(
    kind='scatter',
    x=0,
    y=1,
    ax=ax,
    s=100,
    alpha=0.9,
    c='orange',
)

eigenvectors_centroid_reps

And finally, let's plot the original dataset with the corresponding points highlighted.

moons['labels_lpn_kmeans'] = kmeans.labels_

fig, ax = plt.subplots(figsize=(16, 14));

moons.plot(
    kind='scatter',
    x='x',
    y='y',
    ax=ax,
    s=10,
    alpha=0.7,
    c='labels_lpn_kmeans',
    cmap='Spectral'
);

moons.iloc[[representative_index_0, representative_index_1]].plot(
    kind='scatter',
    x='x',
    y='y',
    ax=ax,
    s=100,
    alpha=0.9,
    c='orange',
);

noisy_moons_centroid_reps

As we can see, the highlighted points are maybe not where we might expect them to be, but this might be useful to give some way of algorithmically choosing points from each cluster.

Upvotes: 2

Athanasios Salamanis
Athanasios Salamanis

Reputation: 346

Spectral clustering does not compute any centroids. In a more practical context, if you really need a kind of 'centroids' derived by the spectral clustering algorithm you can always compute the average (mean) of the points belonging at the same cluster, after the end of the clustering process. These would be an approximation of the centroids defined in the context of the typical k-means algorithm. The same principle applies also in other clustering algorithms that do not produce centroids (e.g. hierarchical).

Upvotes: 1

Ibraim Ganiev
Ibraim Ganiev

Reputation: 9390

UPDATED: Sorry, I've not understood your question correctly at first time.

I think it's impossible to do what you want with Spectral Clustering, because spectral clustering method by itself doesn't compute any centers, it doesn't needs them at all. It even doesn't operates on sample points in raw space, Spectral Clustering transforms your dataset into different subspace and then tries to cluster points at this dataset. And i don't know how to invert this transformation mathematically.

A Tutorial on Spectral Clustering

Maybe you should ask your question as more theoretical on Math-related communities of SO.

spectral = cluster.SpectralClustering(n_clusters=2, eigen_solver='arpack', affinity="nearest_neighbors")
spectral.fit(X)
y_pred = spectral.labels_.astype(np.int)

From here

Upvotes: 3

Related Questions