Mario
Mario

Reputation: 1976

Problem with creating bar plots for KMeans-based clustering algorithm

I'm struggling to plot bar charts for the KMeans-based clustering algorithm. The problem is I want to demonstrate clusters in such a way that the very outlier cluster can be depicted at the end of the x-axis & the rest of the clusters stay relatively next to each other. I think the problem is xsticks, which are equally distributed on the x-axis:

---|---|---|-----------------> x-axis
0  1   2   3 

in this context, I want to show that, e.g. cluster predicted with labelled 3 based on Score located a bit far which needs some adjustment concerning bins width maybe like this:

---|---|--------------|------> x-axis
0  1   2              3 

So far I reached the following results to demonstrate results of the KM-based algorithm concerning outlier detection: img

from sklearn.cluster import KMeans
import seaborn as sns
import numpy as np
from pandas import DataFrame
from math import pow
import math

class ODKM:
    
    def __init__(self,n_clusters=15,effectiveness=500,max_iter=2):
        self.n_clusters=n_clusters
        self.effectiveness=effectiveness
        self.max_iter=max_iter
        self.kmeans = {}
        self.cluster_score = {}
        #self.labels = {}
        
    def fit(self, data):
        length = len(data)
        for column in data.columns:
            kmeans = KMeans(n_clusters=self.n_clusters,max_iter=self.max_iter)
            self.kmeans[column]=kmeans
            kmeans.fit(data[column].values.reshape(-1,1))
            assign = DataFrame(kmeans.predict(data[column].values.reshape(-1,1)),columns=['cluster'])
            cluster_score=assign.groupby('cluster').apply(len).apply(lambda x:x/length)
            ratio=cluster_score.copy()
        
            sorted_centers = sorted(kmeans.cluster_centers_)
            max_distance = ( sorted_centers[-1] - sorted_centers[0] )[ 0 ]
        
            for i in range(self.n_clusters):
                for k in range(self.n_clusters):
                    if i != k:
                        dist = abs(kmeans.cluster_centers_[i] - kmeans.cluster_centers_[k])/max_distance
                        effect = ratio[k]*(1/pow(self.effectiveness,dist))
                        cluster_score[i] = cluster_score[i]+effect
                        
            self.cluster_score[column] = cluster_score
                    
    def predict(self, data):
        length = len(data)
        score_array = np.zeros(length)
        for column in data.columns:
            kmeans = self.kmeans[ column ]
            cluster_score = self.cluster_score[ column ]
            #labels = kmeans.labels_ 
            assign = kmeans.predict( data[ column ].values.reshape(-1,1) )
            #print(assign)
            
            for i in range(length):
                score_array[i] = score_array[i] + math.log10( cluster_score[assign[i]] )
            
        return score_array #,labels
    
    def fit_predict(self,data):
        self.fit(data)
        return self.predict(data)

test the results:

import pandas as pd

df = pd.DataFrame(data={'attr1':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,7,7,7,7,15],
                        'attr2':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,13,13,13,14,15]})

#generate score from KM-based algorithm via class ODKM
odkm_model = ODKM(n_clusters=3, max_iter=1)
result = odkm_model.fit_predict(df)

#include generated scores to the main frame to reach desired plot
df['ODKM_Score']= result 
df

#for i in result:
#    print(round(i,2))

#results
#-0.51, -0.51 , -0.51 , -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51, -0.51
#-0.78, -0.78, -0.78, -0.78, -0.78, -0.78, -0.78
#-0.99, -0.99, -0.99, -0.99
#-1.99

You can find my entire code, including this KM-based algorithm, in colab notebook for quick debugging. Please feel free to implement your solutions on notebook or comment on cells if you need it, or some changes within the ODKM algorithm itself (where KM clustering executing) has been scripted can access in the form of @class ODKM. Maybe it is better to extract predicted cluster labels and add a new column under the title of Cluster_label next to the ODKM algorithm Score for better access to the bar plots.

The expected output should be like this(better bins within the same clusters have the same color, e. g. 1st cluster C1):

img

Update: Apart from the bar plot solution, I can plot Hist & distribution but I can't figure out how to colour up and pass cluster labels to reflect clustering results on bins within histogram as expected.

##left output
# just plot 'Score' column (not all columsn in 1st phase) to simply the problem
#cols_ = df.columns[-1:] 
ax1 = plt.subplot2grid((1,1), (0,0))
df['Score'].plot(kind='hist', ax=ax1 , color='b', alpha=0.4)
df['Score'].plot(kind='kde', ax=ax1, secondary_y=True, label='distribution', color='b', lw=2)

##Right output
sns.distplot(df['Score'] , color='b')

Despite reflecting results of clustering on the graph, I noticed that there are some differences, as I highlighted in the below picture between these two plots e. g. the scale of the y-axis & gap issue between main bins close to the origin of the x-axis:

img

I also found this post, but I couldn't adapt to @class ODKM to resolve my problem dynamically. I could also achieve this recently:

df['Score'] = df['Score'].abs()
sns.displot(df, 
            x='Score',
            hue='Cluster_labels',
            palette=["#00f0f0","#ff0000","#00ff00"],
             alpha=1)

img

Upvotes: 1

Views: 709

Answers (2)

lazy
lazy

Reputation: 784

import pandas as pd


df = pd.DataFrame(data={'attr1':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,7,7,7,7,15],
                        'attr2':[1,1,1,1,2,2,2,2,2,2,2,2,3,5,5,6,6,7,7,7,13,13,13,14,15]
                        })

import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN

def kmeans_scatterplot(df):

    column_i = 'attr1'
    column_j = 'attr2'

    df_temp = df[[column_i, column_j]]
    
    # model
    y_pred = DBSCAN(eps = 3, min_samples = 1).fit_predict(df_temp)
    
    # plot
    plt.scatter(df_temp[column_i], df_temp[column_j], c=y_pred, cmap='rainbow', alpha=0.7, edgecolors='b')

    plt.show()
kmeans_scatterplot(df)

This clustering only needs to specify the distance, and then we can label the color according to the category.

It will help you quickly understand the principle of this algorithm: https://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

Upvotes: 0

warped
warped

Reputation: 9481

For the 1D case, you can use the centers of the clusters as the x-positions for your bars.

n_clusters=3

km = KMeans(init='k-means++', n_clusters=n_clusters).fit(df[['Score']])

counts = np.bincount(km.labels_)

for center, count, label in zip(km.cluster_centers_, counts, range(n_clusters)):
    print(center, count)
    plt.bar(center, count, width=0.2, label=label)

Upvotes: 0

Related Questions