Reputation: 3359
I'm going to be using sklearn to cluster data for a project I have with my company. For the beginning part I have to demonstrate that I am able to cluster data. In R this would be no problem for me, but R isn't so easy to use with HBase. I don't want to tarry but the problem is that I don't know at what point the data points receive labels. Also, this is a 3D plot, so why does X
(iris.data
) have 4 numbers ([ 5.4 3.9 1.3 0.4]
) per datapoint?
What I truly need out of this is to know which data point corresponds to which cluster. I don't need the visual.
Here's the code (pulled from here)
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn.cluster import KMeans
from sklearn import datasets
np.random.seed(5)
centers = [[1, 1], [-1, -1], [1, -1]]
iris = datasets.load_iris()
X = iris.data
y = iris.target
estimators = {'k_means_iris_3': KMeans(n_clusters=3),
'k_means_iris_8': KMeans(n_clusters=8),
'k_means_iris_bad_init': KMeans(n_clusters=3, n_init=1,
init='random')}
fignum = 1
for name, est in estimators.items():
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=labels.astype(np.float))
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
fignum = fignum + 1
# Plot the ground truth
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
for name, label in [('Setosa', 0),
('Versicolour', 1),
('Virginica', 2)]:
ax.text3D(X[y == label, 3].mean(),
X[y == label, 0].mean() + 1.5,
X[y == label, 2].mean(), name,
horizontalalignment='center',
bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Reorder the labels to have colors matching the cluster results
y = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 3], X[:, 0], X[:, 2], c=y)
ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([])
ax.set_xlabel('Petal width')
ax.set_ylabel('Sepal length')
ax.set_zlabel('Petal length')
plt.show()
Upvotes: 1
Views: 681
Reputation: 8207
Labels
Here is the result of adding two print
statements to your code, which will show you when the labels are being generated.
for name, est in estimators.items():
print est
fig = plt.figure(fignum, figsize=(4, 3))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)
plt.cla()
est.fit(X)
labels = est.labels_
print labels
est
shows the parameters for the estimator that was used. As you can see the first one has 8 clusters, reflected by 0-7 cluster assignments in the labels
.
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=8, n_init=10,
n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
verbose=0)
[1 5 5 5 1 1 5 1 5 5 1 5 5 5 1 1 1 1 1 1 1 1 5 1 5 5 1 1 1 5 5 1 1 1 5 5 1
5 5 1 1 5 5 1 1 5 1 5 1 5 2 2 2 7 2 7 2 6 2 7 6 7 7 2 7 2 7 7 2 7 4 7 4 2
2 2 2 2 2 7 7 7 7 4 7 2 2 2 7 7 7 2 7 6 7 7 7 2 6 7 0 4 3 0 0 3 7 3 0 3 0
4 0 4 4 0 0 3 3 4 0 4 3 4 0 3 4 4 0 3 3 3 0 4 4 3 0 0 4 0 0 0 4 0 0 0 4 0
0 4]
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
verbose=0)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
1 2]
KMeans(copy_x=True, init='random', max_iter=300, n_clusters=3, n_init=1,
n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
verbose=0)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2
2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 2 1 2 2 2 1 2
2 1]
Dimensions
The iris
dataset has 4 dimensions (attributes), if you look here, you'll see there are 4 dimensions. The one dimension you aren't plotting in this example is Sepal Width
. You can see what each data point corresponds to by putting print iris
in after iris = datasets.load_iris()
. It prints out a lot of information, but the important information is at the bottom (not so pretty by the way). It looks like this-
:Attribute Information:\n - sepal length in cm\n - sepal width in cm\n - petal length in cm\n - petal width in cm
The attributes correspond to X[flower][0], X[flower][1], X[flower][2], X[flower][3].
Assignment
To see cluster assignments for each data point add this right below labels = est.labels_
:
for flower in range(len(labels)):
print (X[flower],labels[flower])
will get you the results below, just showing one way to access the data points cluster assignments, you probably don't care to print them, rather store them somewhere meaningful.
(array([ 5.1, 3.5, 1.4, 0.2]), 1)
(array([ 4.9, 3. , 1.4, 0.2]), 5)
(array([ 4.7, 3.2, 1.3, 0.2]), 5)
Upvotes: 1