ValueError while predicting a document in a scikit-learn k-means cluster

Question

I am trying to predict a cluster for a bunch of test documents in a trained k-means model using scikit-learn.

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(train_documents)
k = 10
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

The model is generated without any problem with 10 clusters. But when I try to predict a list of documents, I get an error.

predicted_cluster = model.predict(test_documents)

Error message:

ValueError: could not convert string to float...

Do I need to use PCA to reduce the number of features, or do I need to do preprocessing for the text document?

Vivek Kumar · Accepted Answer

You need to transform the test_documents the same way in which train was transformed.

X_test = vectorizer.transform(test_documents)
predicted_cluster = model.predict(X_test)

Make sure you only call transform on the test documents and use the same vectorizer object which was used for fit() or fit_transform() on train documents.

ValueError while predicting a document in a scikit-learn k-means cluster

Answers (1)

Related Questions