Reputation: 1112
I am learning how to run a K-means model using make_pipeline to standardize the values of my dataset columns.
I am following a DataCamp course, but I am not clear why they fit and predict the model on the same dataset -in the Datacamp case "movements", a daily stock value dataset. I thought the whole purpose of K-means model was to be trained on a training dataset and to predict a test one?
Unlike the Datacamp case, I'd like to train my model on a column-standardized training dataset and to test it on a column-standardized testing dataset. How to do it? I am copying and pasting the Datacamp code below for reference.
# Import Normalizer
from sklearn.preprocessing import Normalizer
# Create a normalizer: normalizer
normalizer = Normalizer()
# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters = 5)
# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)
# Fit pipeline to the daily price movements
pipeline.fit(movements)
# Predict the cluster labels: labels
labels = pipeline.predict(movements)
Upvotes: 2
Views: 305
Reputation: 16966
I think you are confusing between KNN and K-Means model. KNN is a model used in supervised learning for both classification and regression whereas K-Means is a clustering model, which comes under unsupervised learning (you don't have target variable here!), where don't usually do a train and test split.
If your intention is to measure the performance of K-Means, read here
Upvotes: 1