Reputation: 473
I'm working on a Python function (cluster_articles) to perform document clustering and return a dictionary of results. However, I'm encountering the following test errors:
TypeError: 'int' object is not iterable (in test_number_of_observations_kmeans10 and possibly test_proper_dict_return) AssertionError: Assertion error at PCA explained value (in test_pca_explained)
import pickle
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import completeness_score, v_measure_score
def cluster_articles(data):
# K-Means on original data
kmeans_100 = KMeans(n_clusters=10, random_state=2, tol=0.05, max_iter=50)
kmeans_100.fit(data['vectors'])
labels_100 = kmeans_100.labels_
# PCA Dimensionality Reduction
pca = PCA(n_components=10, random_state=2)
reduced_data = pca.fit_transform(data['vectors'])
# K-Means on reduced data
kmeans_10 = KMeans(n_clusters=10, random_state=2, tol=0.05, max_iter=50)
kmeans_10.fit(reduced_data)
labels_10 = kmeans_10.labels_
print(type(kmeans_10.n_iter_)) # Debugging output
# Results Dictionary (Potential issue here)
result = {
'nobs_100': kmeans_100.n_iter_,
'nobs_10': kmeans_10.n_iter_,
'pca_explained': pca.explained_variance_ratio_[0],
# ... rest of the results
}
return result
Task and Data Description:
Goal: Cluster documents using K-Means (with and without PCA). Calculate metrics like completeness score, V-measure, and PCA explained variance.
Data Structure (data dictionary):
Relevant Packages:
scikit-learn (0.24.1)
NumPy (1.20.1)
SciPy (1.6.1)
pandas (1.2.3)
Questions:
What I've Tried: Printing the type of kmeans_10.n_iter_ confirms it's an integer.
Additional Notes:
I don't have access to the test code.
There might be a file "subset_documents.p" which could be relevant.
Thank you for your help!
Upvotes: 0
Views: 50