user6882757
user6882757

Reputation:

How to scatter plot for Kmeans and print the outliers

I'm working with the Scikit-Learn KMeans model.

This is the code I have implemented, where I have created 3 clusters (0, 1, 2):

df = pd.read_csv(r'1.csv',index_col=None)
dummies = pd.get_dummies(data = df)
km = KMeans(n_clusters=3).fit(dummies)
dummies['cluster_id'] = km.labels_
def distance_to_centroid(row, centroid):
    row = row[['id', 'product', 'store', 'revenue','store_capacity', 'state_AL', 'state_CA', 'state_CH',
       'state_WD', 'country_India', 'country_Japan', 'country_USA']]
    return euclidean(row, centroid)
dummies['distance_to_center0'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[0]),1)

dummies['distance_to_center1'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[1]),1)

dummies['distance_to_center2'] = dummies.apply(lambda r: distance_to_centroid(r,
    km.cluster_centers_[2]),1)

dummies.head()

This is a sample of the data set that I am using:

   id,product,store,revenue,store_capacity,state
    1,Ball,AB,222,1000,CA
    1,Pen,AB,234,1452,WD
    2,Books,CD,543,888,MA
    2,Ink,EF,123,9865,NY

Upvotes: 3

Views: 1888

Answers (1)

Roim
Roim

Reputation: 3066

To create a scatter plot for the clusters you just need to color each point by his cluster. Take for example the following code:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns

df = pd.DataFrame(np.random.rand(10,2), columns=["A", "B"])
km = KMeans(n_clusters=3).fit(df)
df['cluster_id'] = km.labels_
dic = {0:"Blue", 1:"Red", 2:"Green"}
sns.scatterplot(x="A", y="B", data=df, hue="cluster_id", palette = dic)

output: (remember it's involve random)

enter image description here

hue divide points by their 'cluster_id' value - in our case, different clusters. palette is just to control colors (which was defined in dic one line earlier)

Your data consists more then two labels. As you know, we can not plot a 6-dimensional scatter plot. You can do one of the following:

  1. Select only 2 features and show them (feature selection)
  2. Reduce dimensions with PCA/TSNE/other algorithm and use new features for scatter (feature extraction)

As for your second question, it depends on how you define "outliers". There is no single definition, and it depends on the case. After running KMeans every point is assigned to a cluster. KMeans does not give you "well, I'm not sure about that point. It's probably an outlier". Once you decide on a definition for outlier (e.g. "distance from center > 3") you just check if a point is an outlier, and print it.

If I misunderstood any of questions, please clarify. It is better to be more precise about what you're trying to do in order for the community to help you.

Upvotes: 2

Related Questions