Reputation: 1976
I'm struggling to plot bar charts for the KMeans-based clustering algorithm. The problem is I want to demonstrate clusters in such a way that the very outlier cluster can be depicted at the end of the x-axis & the rest of the clusters stay relatively next to each other. I think the problem is xsticks
, which are equally distributed on the x-axis:
---|---|---|-----------------> x-axis
0 1 2 3
in this context, I want to show that, e.g. cluster predicted with labelled 3
based on Score
located a bit far which needs some adjustment concerning bins width maybe like this:
---|---|--------------|------> x-axis
0 1 2 3
So far I reached the following results to demonstrate results of the KM-based algorithm concerning outlier detection:
from sklearn.cluster import KMeans
import seaborn as sns
import numpy as np
from pandas import DataFrame
from math import pow
import math
class ODKM:
def __init__(self,n_clusters=15,effectiveness=500,max_iter=2):
self.n_clusters=n_clusters
self.effectiveness=effectiveness
self.max_iter=max_iter
self.kmeans = {}
self.cluster_score = {}
#self.labels = {}
def fit(self, data):
length = len(data)
for column in data.columns:
kmeans = KMeans(n_clusters=self.n_clusters,max_iter=self.max_iter)
self.kmeans[column]=kmeans
kmeans.fit(data[column].values.reshape(-1,1))
assign = DataFrame(kmeans.predict(data[column].values.reshape(-1,1)),columns=['cluster'])
cluster_score=assign.groupby('cluster').apply(len).apply(lambda x:x/length)
ratio=cluster_score.copy()
sorted_centers = sorted(kmeans.cluster_centers_)
max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]
for i in range(self.n_clusters):
for k in range(self.n_clusters):
if i != k:
dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
effect = ratio[k]*(1/pow(self.effectiveness,dist))
cluster_score[i] = cluster_score[i]+effect
self.cluster_score[column] = cluster_score
def predict(self, data):
length = len(data)
score_array = np.zeros(length)
for column in data.columns:
kmeans = self.kmeans[ column ]
cluster_score = self.cluster_score[ column ]
#labels = kmeans.labels_
assign = kmeans.predict( data[ column ].values.reshape(-1,1) )
#print(assign)
for i in range(length):
score_array[i] = score_array[i] + math.log10( cluster_score[assign[i]] )
return score_array #,labels
def fit_predict(self,data):
self.fit(data)
return self.predict(data)
test the results:
import pandas as pd
df = pd.DataFrame(data={'attr1':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,7,7,7,7,15],
'attr2':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,13,13,13,14,15]})
#generate score from KM-based algorithm via class ODKM
odkm_model = ODKM(n_clusters=3, max_iter=1)
result = odkm_model.fit_predict(df)
#include generated scores to the main frame to reach desired plot
df['ODKM_Score']= result
df
#for i in result:
# print(round(i,2))
#results
#-0.51, -0.51 , -0.51 , -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51
#-0.78, -0.78, -0.78, -0.78, -0.78, -0.78, -0.78
#-0.99, -0.99, -0.99, -0.99
#-1.99
You can find my entire code, including this KM-based algorithm, in colab notebook for quick debugging. Please feel free to implement your solutions on notebook or comment on cells if you need it, or some changes within the ODKM
algorithm itself (where KM clustering executing) has been scripted can access in the form of @class ODKM
. Maybe it is better to extract predicted cluster labels and add a new column under the title of Cluster_label
next to the ODKM
algorithm Score
for better access to the bar plots.
The expected output should be like this(better bins within the same clusters have the same color, e. g. 1st cluster C1
):
Update: Apart from the bar plot solution, I can plot Hist & distribution but I can't figure out how to colour up and pass cluster labels to reflect clustering results on bins within histogram as expected.
##left output
# just plot 'Score' column (not all columsn in 1st phase) to simply the problem
#cols_ = df.columns[-1:]
ax1 = plt.subplot2grid((1,1), (0,0))
df['Score'].plot(kind='hist', ax=ax1 , color='b', alpha=0.4)
df['Score'].plot(kind='kde', ax=ax1, secondary_y=True, label='distribution', color='b', lw=2)
##Right output
sns.distplot(df['Score'] , color='b')
Despite reflecting results of clustering on the graph, I noticed that there are some differences, as I highlighted in the below picture between these two plots e. g. the scale of the y-axis & gap issue between main bins close to the origin of the x-axis:
I also found this post, but I couldn't adapt to @class ODKM
to resolve my problem dynamically.
I could also achieve this recently:
df['Score'] = df['Score'].abs()
sns.displot(df,
x='Score',
hue='Cluster_labels',
palette=["#00f0f0","#ff0000","#00ff00"],
alpha=1)
Upvotes: 1
Views: 709
Reputation: 784
import pandas as pd
df = pd.DataFrame(data={'attr1':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,7,7,7,7,15],
'attr2':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,13,13,13,14,15]
})
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
def kmeans_scatterplot(df):
column_i = 'attr1'
column_j = 'attr2'
df_temp = df[[column_i, column_j]]
# model
y_pred = DBSCAN(eps = 3, min_samples = 1).fit_predict(df_temp)
# plot
plt.scatter(df_temp[column_i], df_temp[column_j], c=y_pred, cmap='rainbow', alpha=0.7, edgecolors='b')
plt.show()
kmeans_scatterplot(df)
This clustering only needs to specify the distance, and then we can label the color according to the category.
It will help you quickly understand the principle of this algorithm: https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/
Upvotes: 0
Reputation: 9481
For the 1D case, you can use the centers of the clusters as the x-positions for your bars.
n_clusters=3
km = KMeans(init='k-means++', n_clusters=n_clusters).fit(df[['Score']])
counts = np.bincount(km.labels_)
for center, count, label in zip(km.cluster_centers_, counts, range(n_clusters)):
print(center, count)
plt.bar(center, count, width=0.2, label=label)
Upvotes: 0