How to standardize a training and a test dataset through make_pipeline()

Question

I am learning how to run a K-means model using make_pipeline to standardize the values of my dataset columns.

I am following a DataCamp course, but I am not clear why they fit and predict the model on the same dataset -in the Datacamp case "movements", a daily stock value dataset. I thought the whole purpose of K-means model was to be trained on a training dataset and to predict a test one?

Unlike the Datacamp case, I'd like to train my model on a column-standardized training dataset and to test it on a column-standardized testing dataset. How to do it? I am copying and pasting the Datacamp code below for reference.

# Import Normalizer
from sklearn.preprocessing import Normalizer 

# Create a normalizer: normalizer
normalizer = Normalizer()

# Create a KMeans model with 10 clusters: kmeans
kmeans = KMeans(n_clusters = 5)

# Make a pipeline chaining normalizer and kmeans: pipeline
pipeline = make_pipeline(normalizer, kmeans)

# Fit pipeline to the daily price movements
pipeline.fit(movements)

# Predict the cluster labels: labels
labels = pipeline.predict(movements)

Venkatachalam · Accepted Answer

I think you are confusing between KNN and K-Means model. KNN is a model used in supervised learning for both classification and regression whereas K-Means is a clustering model, which comes under unsupervised learning (you don't have target variable here!), where don't usually do a train and test split.

If your intention is to measure the performance of K-Means, read here

How to standardize a training and a test dataset through make_pipeline()

Answers (1)

Related Questions