Reputation: 273
I have a huge amount of data which I would like to run a kmean classification on. The dataset are so big, that I cannot load the files into memory.
My idea is to run the classifiation on some part of the dataset like a training dataset, and then apply the calssification to the rest of the dataset part by part.
import pandas as pd
import pickle
from sklearn.cluster import KMeans
frames = [pd.read_hdf(fin) for fin in ifiles]
data = pd.concat(frames, ignore_index=True, axis=0)
data.dropna(inplace=True)
k = 12
x = pd.concat(data['A'], data['B'], data['C'], axis=1, keys=['A','B','C'])
model = KMeans(n_clusters=k, random_state=0, n_jobs = -2)
model.fit(x)
pickle.dump(model, open(filename, 'wb'))
x looks like this:
array([[-2.26732099, 0.24895614, 2.34840191],
[-2.26732099, 0.22270912, 1.88942378],
[-1.99246557, 0.04154312, 2.63458941],
...,
[-4.29596287, 1.97036309, -0.22767511],
[-4.26055474, 1.72347591, -0.18185197],
[-4.15980382, 1.73176239, -0.30781225]])
The model look like this:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
n_clusters=12, n_init=10, n_jobs=-2, precompute_distances='auto',
random_state=0, tol=0.0001, verbose=0)
A plot of two of the model parameters color coded with the model looks like this:
Now I want to load the model and use it for predicting. As a test example I have loaded the same data (not shown here), and trying to predict the new dataset.
modelnew = pickle.load(open('test.pkl', 'rb'))
modelnew.predict(x)
This data does clearly not cluster. What am I missing? Do I need to fix the model parameters in some way?
I have tried to make an example of a test and a train data set. Here it also goes wrong. There is clearly something I am missing:
## Splitting data in a test and train data set
sample_train, sample_test = train_test_split(x, test_size=0.50)
k = 12 ## Setting number of clusters
model = KMeans(n_clusters=k, random_state=0, n_jobs = -2) ## Kmeans model
train = model.fit(sample_train) ## Fitting the training data
model.predict(sample_test) # Predicting the test data
centroids = model.cluster_centers_
labels = model.labels_
## Figures
cmap_model = np.array(['red', 'lime', 'black', 'green', 'orange', 'blue', 'gray', 'magenta', 'cyan', 'purple', 'pink', 'lightblue', 'brown', 'yellow'])
plt.figure()
plt.scatter(sample_train[:,0], sample_train[:,1], c=cmap_model[train.labels_], s=10, edgecolors='none')
plt.scatter(centroids[:, 0], centroids[:, 1], c=cmap_model, marker = "x", s=150, linewidths = 5, zorder = 10)
plt.figure()
plt.scatter(sample_test[:,0], sample_test[:,1], c=cmap_model[labels], s=10, edgecolors='none')
plt.scatter(centroids[:, 0], centroids[:, 1], c=cmap_model, marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()
Upvotes: 10
Views: 9851
Reputation: 11
You may need to change line labels = model.labels_ to labels = model.predict(sample_test)
Upvotes: 0
Reputation: 52377
What kmeans does is minimize the sum of all distances between sample points and their corresponding cluster centers. The association of a sample point to a cluster is solely based on its distance to the cluster center.
This means that as soon as you found a set of cluster centers, there is not much that can go wrong in the prediction step. The output you are showing indicates that predict is not working the way it should at all.
Did you try the same without saving/loading the model object in between? Did you make sure the data in the reduced and in the full set has exactly the same format?
The only drawback I see in your idea of learning cluster centers on a reduced sample set is that the sample set must be representative of the whole data. In the worst case you would have a bigger area of sample points that was not covered in the training set and therefore is all assigned to a closest cluster center that is off. It would certainly not look random as in your example.
Upvotes: 0