Reputation: 103
I am using python scikit-learn
for document clustering and I have a sparse matrix stored in a dict
object:
For example:
doc_term_dict = { ('d1','t1'): 12, \
('d2','t3'): 10, \
('d3','t2'): 5 \
} # from mysql data table
<type 'dict'>
I want to use scikit-learn
to do the clustering where the input matrix type is scipy.sparse.csr.csr_matrix
Example:
(0, 2164) 0.245793088885
(0, 2076) 0.205702177467
(0, 2037) 0.193810934784
(0, 2005) 0.14547028437
(0, 1953) 0.153720023365
...
<class 'scipy.sparse.csr.csr_matrix'>
I can't find a way to convert dict
to this csr-matrix (I have never used scipy
.)
Upvotes: 5
Views: 4959
Reputation: 21
An alternative approach that makes use of np.fromiter
, as an alternative to using list
to store elements.
from scipy.sparse import csr_matrix
import numpy as np
def _dict_to_csr(term_dict, shape=None):
data = np.fromiter(term_dict.values(), dtype=np.float32)
rows_tuple, columns_tuple = zip(*term_dict.keys())
rows = np.fromiter(rows_tuple, dtype=int)
columns = np.fromiter(columns_tuple, dtype=int)
return csr_matrix((data, (rows, columns)), shape=shape)
Upvotes: 0
Reputation: 2810
Same as @carsonc, but for Python 3.X :
from scipy.sparse import csr_matrix
def _dict_to_csr(term_dict):
term_dict_v = term_dict.values()
term_dict_k = term_dict.keys()
term_dict_k_zip = zip(*term_dict_k)
term_dict_k_zip_list = list(term_dict_k_zip)
shape = (len(term_dict_k_zip_list[0]), len(term_dict_k_zip_list[1]))
csr = csr_matrix((list(term_dict_v), list(map(list, zip(*term_dict_k)))), shape = shape)
return csr
Upvotes: 0
Reputation: 85
We can make @Unapiedra's (excellent) answer a little more sparse:
from scipy.sparse import csr_matrix
def _dict_to_csr(term_dict):
term_dict_v = list(term_dict.itervalues())
term_dict_k = list(term_dict.iterkeys())
shape = list(repeat(np.asarray(term_dict_k).max() + 1,2))
csr = csr_matrix((term_dict_v, zip(*term_dict_k)), shape = shape)
return csr
Upvotes: 2
Reputation: 16197
Pretty straightforward. First read the dictionary and convert the keys to the appropriate row and column. Scipy supports (and recommends for this purpose) the COO-rdinate format for sparse matrices.
Pass it data
, row
, and column
, where A[row[k], column[k] = data[k]
(for all k) defines the matrix. Then let Scipy do the conversion to CSR.
Please check, that I have rows and columns in the way you want them, I might have them transposed. I also assumed that the input would be 1-indexed.
My code below prints:
(0, 0) 12
(1, 2) 10
(2, 1) 5
Code:
#!/usr/bin/env python3
#http://stackoverflow.com/questions/26335059/converting-python-sparse-matrix-dict-to-scipy-sparse-matrix
from scipy.sparse import csr_matrix, coo_matrix
def convert(term_dict):
''' Convert a dictionary with elements of form ('d1', 't1'): 12 to a CSR type matrix.
The element ('d1', 't1'): 12 becomes entry (0, 0) = 12.
* Conversion from 1-indexed to 0-indexed.
* d is row
* t is column.
'''
# Create the appropriate format for the COO format.
data = []
row = []
col = []
for k, v in term_dict.items():
r = int(k[0][1:])
c = int(k[1][1:])
data.append(v)
row.append(r-1)
col.append(c-1)
# Create the COO-matrix
coo = coo_matrix((data,(row,col)))
# Let Scipy convert COO to CSR format and return
return csr_matrix(coo)
if __name__=='__main__':
doc_term_dict = { ('d1','t1'): 12, \
('d2','t3'): 10, \
('d3','t2'): 5 \
}
print(convert(doc_term_dict))
Upvotes: 5