Reputation:
I'm working with the Scikit-Learn KMeans model.
This is the code I have implemented, where I have created 3 clusters (0, 1, 2):
df = pd.read_csv(r'1.csv',index_col=None)
dummies = pd.get_dummies(data = df)
km = KMeans(n_clusters=3).fit(dummies)
dummies['cluster_id'] = km.labels_
def distance_to_centroid(row, centroid):
row = row[['id', 'product', 'store', 'revenue','store_capacity', 'state_AL', 'state_CA', 'state_CH',
'state_WD', 'country_India', 'country_Japan', 'country_USA']]
return euclidean(row, centroid)
dummies['distance_to_center0'] = dummies.apply(lambda r: distance_to_centroid(r,
km.cluster_centers_[0]),1)
dummies['distance_to_center1'] = dummies.apply(lambda r: distance_to_centroid(r,
km.cluster_centers_[1]),1)
dummies['distance_to_center2'] = dummies.apply(lambda r: distance_to_centroid(r,
km.cluster_centers_[2]),1)
dummies.head()
This is a sample of the data set that I am using:
id,product,store,revenue,store_capacity,state
1,Ball,AB,222,1000,CA
1,Pen,AB,234,1452,WD
2,Books,CD,543,888,MA
2,Ink,EF,123,9865,NY
Upvotes: 3
Views: 1888
Reputation: 3066
To create a scatter plot for the clusters you just need to color each point by his cluster. Take for example the following code:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
import seaborn as sns
df = pd.DataFrame(np.random.rand(10,2), columns=["A", "B"])
km = KMeans(n_clusters=3).fit(df)
df['cluster_id'] = km.labels_
dic = {0:"Blue", 1:"Red", 2:"Green"}
sns.scatterplot(x="A", y="B", data=df, hue="cluster_id", palette = dic)
output: (remember it's involve random)
hue
divide points by their 'cluster_id' value - in our case, different clusters. palette
is just to control colors (which was defined in dic
one line earlier)
Your data consists more then two labels. As you know, we can not plot a 6-dimensional scatter plot. You can do one of the following:
As for your second question, it depends on how you define "outliers". There is no single definition, and it depends on the case. After running KMeans every point is assigned to a cluster. KMeans does not give you "well, I'm not sure about that point. It's probably an outlier". Once you decide on a definition for outlier (e.g. "distance from center > 3") you just check if a point is an outlier, and print it.
If I misunderstood any of questions, please clarify. It is better to be more precise about what you're trying to do in order for the community to help you.
Upvotes: 2