Reputation: 653
I am writing a simple K-means algorithm for clustering and I am trying to render a scatter plot showing sample data ( rows of a sample data loaded from a CSV file into a numpy matrix X).
Let us say X is a numpy matrix with each row containing the example data with 10 features. for my case they are attributes of a network flow containing src IP address, destination IP address , src port or destination port. I have also computed the centroids for K-mean ( where K is the total centroids). I have an list idx which is nothing but indices of the centroid to which individual X-row belongs. for example if row 5 of X numpy matrix belongs to centroid =3, will have an idx[4]=3 ( since we start from 0). With this , each row of X, containing individual data record of 10 features belongs to unique centroid. I want to draw scatter plot the data points in X coloring them separately for each centroid. for example if row 5, 8 of X is closer to centroid 3, I want to color them with a different color. if I were to do it in Octave, I could have written the code like this:-
function plotPoints(X,idx,K)
p= hsv(K+1) % palette
c= p(idx,:) % color
scatter(X(:,1),X(:,2),15,c) % plot the scatter plot
However in python , I am not sure how to implement the same so that I can show data samples with the same index assignment have the same color. My code currently is shows all the X rows in red and all the centroids in Blue as shown below:-
def plotPoints(X,idx,K,centroids):
srcport=X[:,5]
dstport=X[:,6]
fig = plt.figure()
ax=fig.add_subplot(111,projection='3d')
ax.scatter(srcport,dstport,c='r',marker='x')
ax.scatter(centroids[:,5],centroids[:,6],c='b',marker='o', s=160)
ax.set_xlabel('Source port')
ax.set_xlabel('Destination port')
plt.show()
Please note: I am only plotting 2 features on x & y axis and not all of the 10 features. I should have mentioned that earlier.
Upvotes: 0
Views: 4408
Reputation: 2936
Check out the answer to post Scatter plot and Color mapping in Python. I guess your centroids' indices correspond to clusters. In this case you can either use a simple array as colors:
ax.scatter(srcport, dstport, c=idx, marker='x')
ax.scatter(centroids[:,5], centroids[:,6], c=np.arange(K), marker='o', s=160)
or use colormap:
ax.scatter(srcport, dstport, c=plt.cm.viridis(idx / K), marker='x')
ax.scatter(centroids[:,5], centroids[:,6], c=plt.cm.viridis(np.arange(K) / K),
marker='o', s=160)
Upvotes: 2
Reputation: 21264
Seaborn and Pandas work well together for this kind of plotting.
If they're available to you, consider the following solution:
# generate sample data
import numpy as np
values = np.random.random(500).reshape(50,10) * 10
centroid = np.random.choice(np.arange(5), size=50).reshape(-1,1)
data = np.concatenate((values, centroid), axis=1)
# convert to DataFrame
import pandas as pd
colnames = ['a','b','c','d','e','f','g','h','i','j','centroid']
df = pd.DataFrame(data, columns=colnames)
# data frame looks like:
df.head()
a b c d e f g h i j centroid
0 6 9 9 9 1 2 4 0 8 9 4
1 9 1 0 0 7 9 9 3 7 2 1
2 10 4 8 7 2 8 9 4 6 8 3
3 2 6 5 2 8 4 9 3 9 5 4
4 9 7 5 1 3 2 1 8 3 4 4
# plot with Seaborn
import seaborn as sns
sns.lmplot(x='a', y='b', hue='centroid', data=df, scatter=True, fit_reg=False)
Here's a pure Numpy/Pyplot version, if you're restricted to those modules:
from matplotlib import pyplot as plt
fig, ax = plt.subplots()
colors = {0:'purple', 1:'red', 2:'blue', 3:'green', 4:'black'}
ax.scatter(x=data[:,0], y=data[:,1], c=[colors[x] for x in data[:,10]])
Upvotes: 2