Scatter plots in python to represent the points closer to centroids for K-mean clustering

Question

I am writing a simple K-means algorithm for clustering and I am trying to render a scatter plot showing sample data ( rows of a sample data loaded from a CSV file into a numpy matrix X).

Let us say X is a numpy matrix with each row containing the example data with 10 features. for my case they are attributes of a network flow containing src IP address, destination IP address , src port or destination port. I have also computed the centroids for K-mean ( where K is the total centroids). I have an list idx which is nothing but indices of the centroid to which individual X-row belongs. for example if row 5 of X numpy matrix belongs to centroid =3, will have an idx[4]=3 ( since we start from 0). With this , each row of X, containing individual data record of 10 features belongs to unique centroid. I want to draw scatter plot the data points in X coloring them separately for each centroid. for example if row 5, 8 of X is closer to centroid 3, I want to color them with a different color. if I were to do it in Octave, I could have written the code like this:-

function plotPoints(X,idx,K)
  p= hsv(K+1) % palette
  c= p(idx,:) % color
  scatter(X(:,1),X(:,2),15,c) % plot the scatter plot

However in python , I am not sure how to implement the same so that I can show data samples with the same index assignment have the same color. My code currently is shows all the X rows in red and all the centroids in Blue as shown below:-

def plotPoints(X,idx,K,centroids):
    srcport=X[:,5]
    dstport=X[:,6]

    fig = plt.figure()
    ax=fig.add_subplot(111,projection='3d')
    ax.scatter(srcport,dstport,c='r',marker='x')
    ax.scatter(centroids[:,5],centroids[:,6],c='b',marker='o', s=160)
    ax.set_xlabel('Source port')
    ax.set_xlabel('Destination port')
    plt.show()

Please note: I am only plotting 2 features on x & y axis and not all of the 10 features. I should have mentioned that earlier.

andrew_reece · Accepted Answer

Seaborn and Pandas work well together for this kind of plotting.
If they're available to you, consider the following solution:

# generate sample data
import numpy as np
values = np.random.random(500).reshape(50,10) * 10
centroid = np.random.choice(np.arange(5), size=50).reshape(-1,1)
data = np.concatenate((values, centroid), axis=1)

# convert to DataFrame
import pandas as pd
colnames = ['a','b','c','d','e','f','g','h','i','j','centroid']
df = pd.DataFrame(data, columns=colnames)

# data frame looks like:
df.head()

   a  b  c  d  e  f  g  h  i  j  centroid
0  6  9  9  9  1  2  4  0  8  9         4
1  9  1  0  0  7  9  9  3  7  2         1
2 10  4  8  7  2  8  9  4  6  8         3
3  2  6  5  2  8  4  9  3  9  5         4
4  9  7  5  1  3  2  1  8  3  4         4

# plot with Seaborn
import seaborn as sns
sns.lmplot(x='a', y='b', hue='centroid', data=df, scatter=True, fit_reg=False)

Here's a pure Numpy/Pyplot version, if you're restricted to those modules:

from matplotlib import pyplot as plt
fig, ax = plt.subplots()

colors = {0:'purple', 1:'red', 2:'blue', 3:'green', 4:'black'}

ax.scatter(x=data[:,0], y=data[:,1], c=[colors[x] for x in data[:,10]])

Scatter plots in python to represent the points closer to centroids for K-mean clustering

Answers (2)

Related Questions