Reputation: 1353
I am currently working on clustering a data set. My question is, is there any way to save the result of the groups so that in the future I can work with new data and know to which group they belong according to the kmeans "model" I made?
I have learned to work with Kmeans, it is very interesting, but when I want to know what a new data belongs to, right now I repeat the whole process of analysis. And what I would like is according to the old data (we could call it training data) can I define the group of a new data?
This is my code right now.
n_clusters = 15
kmeans = KMeans(n_clusters = n_clusters, init = 'k-means++', max_iter = 3000, n_init = 100, random_state = 0)
y_kmeans = kmeans.fit_predict(data)
data_df['k-means'] = y_kmeans
If I plot my current results, I already have the entire data spectrum occupied. Therefore, any new data must belong to one of the current groups.
#Visualising the clusters
colors = ['blue', 'orange', 'green', 'red', 'yellow', 'cyan', 'brown', 'cadetblue', 'gray',\
'salmon', 'olive', 'deeppink', 'pink', 'gold', 'lime']
for i in range(n_clusters):
plt.scatter(data[y_kmeans == i, 0], data[y_kmeans == i, 1], color=colors[i])
#Plotting the centroids of the clusters
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:,1], label = 'Centroids')
plt.legend()
Obviously with new data, you will also re-study the data for variations.
Thank you very much.
Upvotes: 0
Views: 1587
Reputation: 52317
You can simply keep the cluster centers and assign each new data point to the nearest cluster (ie., minimize the Euclidean distance).
This is what the prediction step in k-means does.
The cluster centers are available as y_kmeans.cluster_centers_
.
Upvotes: 1