Reputation: 64699
How do you cluster sparse data using Sklearn's Kmeans implementation?
Attempting to adapt their example for my own use case, I tried:
from sklearn.feature_extraction import DictVectorizer
from sklearn.cluster import KMeans
mydata = [
(1, {'word1': 2, 'word3': 6, 'word7': 4}),
(2, {'word11': 1, 'word7': 9, 'word3': 2}),
(3, {'word5': 7, 'word1': 3, 'word9': 8}),
]
kmeans_data = []
for index, raw_data in mydata:
cnt_sum = float(sum(raw_data.values()))
freqs = dict((k, v/cnt_sum) for k, v in raw_data.items())
v = DictVectorizer(sparse=True)
X = v.fit_transform(freqs)
kmeans_data.append(X)
kmeans = KMeans(n_clusters=2, random_state=0).fit(kmeans_data)
but this throws the exception:
File "/myproject/.env/lib/python3.5/site-packages/sklearn/cluster/k_means_.py", line 854, in _check_fit_data
X = check_array(X, accept_sparse='csr', dtype=[np.float64, np.float32])
File "/myproject/.env/lib/python3.5/site-packages/sklearn/utils/validation.py", line 382, in check_array
array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: setting an array element with a sequence.
Presumably I'm not constructing my sparse input matrix X correctly, as it's a list of sparse matrices instead of a sparse matrix containing lists. How do I construct a proper input matrix?
Upvotes: 0
Views: 5367
Reputation: 17159
You are building a sparse matrix incrementally. I am not sure if you could use DictVectorizer in an incremental manner. It would be simpler to just add the elements to the matrix one by one. See the last example in scipy.sparse.csr_matrix
documentation.
Incremental construction
Consider the following double loop:
data = []
rows = []
cols = []
vocabulary = {}
for index, raw_data in mydata:
cnt_sum = float(sum(raw_data.values()))
for k,v in raw_data.items():
f = v/cnt_sum
i = vocabulary.setdefault(k,len(vocabulary))
cols.append(i)
rows.append(index-1)
data.append(f)
kmeans_data = csr_matrix((data,(rows,cols)))
Then kmeans_data
is a sparse matrix suitable for use as input to K-means classifier.
Direct construction
With DictVectorizer you could construct the data matrix from the list of tuples and then use sparse linear algebra routines to perform normalization of rows.
# 1. Construct the sparse matrix with numbers_of_occurrences
D = [d[1] for d in mydata]
v = DictVectorizer(sparse=True)
kmeans_data = v.fit_transform(D)
# 2. Normalize by computing sums for each row and dividing
import numpy as np
sums = np.sum(kmeans_data,axis=1).A[:,0]
N = len(s)
divisor = csr_matrix((np.reciprocal(s),(range(N),range(N))))
kmeans_data = divisor*kmeans_data)
Upvotes: 1