James L.
James L.

Reputation: 14515

The supplied model is not a clustering estimator in YellowBrick

I am trying to visualize an elbow plot for my data using YellowBrick's KElbowVisualizer and SKLearn's Expectation Maximization algorithm class: GaussianMixture.

When I run this, I get the error in the title. (I have also tried ClassificationReport, but that fails as well)

model = GaussianMixture()

data = get_data(data_name, preprocessor_name, train_split=0.75)
X, y, x_test, y_test = data

visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure

I cannot find anything in YellowBrick to help me estimate the number of components for expectation maximization.

Upvotes: 2

Views: 2361

Answers (3)

Chris Vandevelde
Chris Vandevelde

Reputation: 1441

Buiding on @bbengfort's great answer, I used:

class GaussianMixtureCluster(GaussianMixture, ClusterMixin):
    """Subclass of GaussianMixture to make it a ClusterMixin."""

    def fit(self, X):
        super().fit(X)
        self.labels_ = self.predict(X)
        return self

    def get_params(self, **kwargs):
        output = super().get_params(**kwargs)
        output["n_clusters"] = output.get("n_components", None)
        return output

    def set_params(self, **kwargs):
        kwargs["n_components"] = kwargs.pop("n_clusters", None)
        return super().set_params(**kwargs)

This lets you use any scoring metric, and works with the latest version of YellowBrick.

Upvotes: 3

bbengfort
bbengfort

Reputation: 5392

Yellowbrick uses the sklearn estimator type checks to determine if a model is well suited to the visualization. You can use the force_model param to bypasses the type checking (though it seems that the KElbow documentation needs to be updated with this).

However, even though force_model=True gets you through the YellowbrickTypeError it still does not mean that GaussianMixture works with KElbow. This is because the elbow visualizer is set up to work with the centroidal clustering API and requires both a n_clusters hyperparam and a labels_ learned param. Expectation maximization models do not support this API.

However, it is possible to create a wrapper around the Gaussian mixture model that will allow it to work with the elbow visualizer (and a similar method could be used with the classification report as well).

from sklearn.base import ClusterMixin
from sklearn.mixture import GaussianMixture
from yellowbrick.cluster import KElbow
from yellowbrick.datasets import load_nfl

class GMClusters(GaussianMixture, ClusterMixin):

    def __init__(self, n_clusters=1, **kwargs):
        kwargs["n_components"] = n_clusters
        super(GMClusters, self).__init__(**kwargs)

    def fit(self, X):
        super(GMClusters, self).fit(X)
        self.labels_ = self.predict(X)
        return self 


X, _ = load_nfl()
oz = KElbow(GMClusters(), k=(4,12), force_model=True)
oz.fit(X)
oz.show()

This does produce a KElbow plot (though not a great one for this particular dataset):

KElbow with distortion score

Another answer mentioned Calinksi Harabasz scores, which you can use in the KElbow visualizer as follows:

oz = KElbow(GMClusters(), k=(4,12), metric='calinski_harabasz', force_model=True)
oz.fit(X)
oz.show()

Creating the wrapper isn't ideal, but for model types that don't fit the standard classifier or clusterer sklearn APIs, they are often necessary and it's a good strategy to have in your back pocket for a number of ML tasks.

Upvotes: 13

Michael Bridges
Michael Bridges

Reputation: 391

You can use the sklearn calinski_harabasz_score- see the relevant docs here.

scores = pd.DataFrame()
components = 100
for n in range(2,components):
    model = GaussianMixture(n_components=n)
    y = model.fit_predict(X)
    scores.loc[n,'score'] = calinski_harabasz_score(X,y)
plt.plot(scores.reset_index()['index'],scores['score'])

Something like this should give similar functionality.

Upvotes: 2

Related Questions