user1643523
user1643523

Reputation: 273

Python: loading a kmeans training dataset and using it to predict a new dataset

I have a huge amount of data which I would like to run a kmean classification on. The dataset are so big, that I cannot load the files into memory.

My idea is to run the classifiation on some part of the dataset like a training dataset, and then apply the calssification to the rest of the dataset part by part.

import pandas as pd
import pickle
from sklearn.cluster import KMeans

frames = [pd.read_hdf(fin) for fin in ifiles]
data = pd.concat(frames, ignore_index=True, axis=0)
data.dropna(inplace=True)

k = 12
x  = pd.concat(data['A'], data['B'], data['C'], axis=1, keys=['A','B','C'])
model = KMeans(n_clusters=k, random_state=0, n_jobs = -2)
model.fit(x)

pickle.dump(model, open(filename, 'wb'))

x looks like this:

array([[-2.26732099,  0.24895614,  2.34840191],
   [-2.26732099,  0.22270912,  1.88942378],
   [-1.99246557,  0.04154312,  2.63458941],
   ..., 
   [-4.29596287,  1.97036309, -0.22767511],
   [-4.26055474,  1.72347591, -0.18185197],
   [-4.15980382,  1.73176239, -0.30781225]])

The model look like this:

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=12, n_init=10, n_jobs=-2, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)

A plot of two of the model parameters color coded with the model looks like this: enter image description here

Now I want to load the model and use it for predicting. As a test example I have loaded the same data (not shown here), and trying to predict the new dataset.

modelnew = pickle.load(open('test.pkl', 'rb'))
modelnew.predict(x)

The result: enter image description here

This data does clearly not cluster. What am I missing? Do I need to fix the model parameters in some way?

I have tried to make an example of a test and a train data set. Here it also goes wrong. There is clearly something I am missing:

## Splitting data in a test and train data set
sample_train, sample_test = train_test_split(x, test_size=0.50)

k = 12 ## Setting number of clusters
model = KMeans(n_clusters=k, random_state=0, n_jobs = -2) ## Kmeans model
train = model.fit(sample_train) ## Fitting the training data
model.predict(sample_test) # Predicting the test data

centroids =  model.cluster_centers_
labels = model.labels_

## Figures
cmap_model = np.array(['red', 'lime', 'black', 'green', 'orange', 'blue', 'gray', 'magenta', 'cyan', 'purple', 'pink', 'lightblue', 'brown', 'yellow'])
plt.figure()
plt.scatter(sample_train[:,0], sample_train[:,1], c=cmap_model[train.labels_], s=10, edgecolors='none')
plt.scatter(centroids[:, 0], centroids[:, 1], c=cmap_model,  marker = "x", s=150, linewidths = 5, zorder = 10)

plt.figure()
plt.scatter(sample_test[:,0], sample_test[:,1], c=cmap_model[labels], s=10, edgecolors='none')
plt.scatter(centroids[:, 0], centroids[:, 1], c=cmap_model,  marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()

Train data: Train result

Test data: Test result

Upvotes: 10

Views: 9851

Answers (3)

Pravesh Sharma
Pravesh Sharma

Reputation: 1

Use X_test = X_test[X_train.columns] to fix the issue.

Upvotes: 0

New Bie
New Bie

Reputation: 11

You may need to change line labels = model.labels_ to labels = model.predict(sample_test)

Upvotes: 0

ypnos
ypnos

Reputation: 52377

What kmeans does is minimize the sum of all distances between sample points and their corresponding cluster centers. The association of a sample point to a cluster is solely based on its distance to the cluster center.

This means that as soon as you found a set of cluster centers, there is not much that can go wrong in the prediction step. The output you are showing indicates that predict is not working the way it should at all.

Did you try the same without saving/loading the model object in between? Did you make sure the data in the reduced and in the full set has exactly the same format?

The only drawback I see in your idea of learning cluster centers on a reduced sample set is that the sample set must be representative of the whole data. In the worst case you would have a bigger area of sample points that was not covered in the training set and therefore is all assigned to a closest cluster center that is off. It would certainly not look random as in your example.

Upvotes: 0

Related Questions