Reputation: 767
I am trying to predict a cluster for a bunch of test documents in a trained k-means model using scikit-learn.
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(train_documents)
k = 10
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)
The model is generated without any problem with 10 clusters. But when I try to predict a list of documents, I get an error.
predicted_cluster = model.predict(test_documents)
Error message:
ValueError: could not convert string to float...
Do I need to use PCA to reduce the number of features, or do I need to do preprocessing for the text document?
Upvotes: 0
Views: 753
Reputation: 36599
You need to transform the test_documents
the same way in which train was transformed.
X_test = vectorizer.transform(test_documents)
predicted_cluster = model.predict(X_test)
Make sure you only call transform
on the test documents and use the same vectorizer
object which was used for fit()
or fit_transform()
on train documents.
Upvotes: 1