Reputation: 288
I have used KMeans algorithm over my database and I have created a PairGrid to show the points with the hue given by cluster.
km = KMeans(init='random', n_clusters=N, random_state = RANDOM_STATE).fit(df_scaled)
df_numeric["cluster"] = km.labels_
g = sns.PairGrid(data=df_numeric, hue = "cluster", corner = True, palette = "viridis")
g.map_lower(sns.scatterplot, marker=".")
g.map_diag(sns.histplot, color = 0.1)
g.add_legend(frameon=True)
g.legend.set_bbox_to_anchor((.61, .6))
The result is something like this:
I would like now to include in the graph the centroids of each cluster, ideally with a distinct symbol (as a star, for example). Is there any easy way to this in Seaborn? Take in mind that I am calculating the clusters over the scaled dataframe (df_scaled
) but I want to plot the original dataframe (df_numeric
), so the value of the centroids is not immediately useful, I would need to undo the scaling or label the centroid somehow.
Thank you in advance
Upvotes: 1
Views: 308
Reputation: 80279
You'll need to do the inverse transformation on the centers to get them back into scale of df_numeric
. A custom function can be used to draw the centers. Here is an example using the iris dataset:
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.cluster import KMeans
def show_centers(x, y, centers, color, label):
x_col = df_numeric.columns.get_loc(x.name)
if y is None: # for the histograms
plt.axvline(centers[label, x_col], color='r', ls=':')
else:
y_col = df_numeric.columns.get_loc(y.name)
plt.scatter(centers[label, x_col], centers[label, y_col], marker='*', color='r', s=30)
df_numeric = sns.load_dataset('iris').drop(columns='species')
df_mins = df_numeric.min().values
df_maxs = df_numeric.max().values
df_scaled = (df_numeric - df_mins) / (df_maxs - df_mins)
km = KMeans(init='random', n_clusters=3).fit(df_scaled)
df_mins = df_mins.reshape(1, -1)
df_maxs = df_maxs.reshape(1, -1)
centers = km.cluster_centers_ * (df_maxs - df_mins) + df_mins
df_numeric["cluster"] = km.labels_
g = sns.PairGrid(data=df_numeric, hue="cluster", corner=True, palette="viridis")
g.map_lower(sns.scatterplot, marker=".")
g.map_lower(show_centers, centers=centers)
g.map_diag(sns.histplot)
g.map_diag(show_centers, y=None, centers=centers)
g.add_legend(frameon=True, bbox_to_anchor=(.61, .6), loc='center', title='Cluster')
plt.tight_layout()
plt.show()
Upvotes: 2