Reputation: 533
I am starting dealing with sparse matrices so I'm not really proficient on this topic. My problem is, I have a simple coo-occurrences matrix from a word list, just a 2-dimensional co-occurrence matrix word by word counting how many times a word occurs in same context. The matrix is quite sparse since the corpus is not that big. I want to convert it to a sparse matrix to be able to deal better with it, eventually do some matrix multiplication afterwards. Here what I have done until now (only the first part, the rest is just output format and cleaning data):
def matrix(from_corpus):
d = defaultdict(lambda : defaultdict(int))
heads = set()
trans = set()
for text in corpus:
d[text[0]][text[1]] += 1
heads.add(text[0])
trans.add(text[1])
return d,heads,trans
My idea would be to make a new function:
def matrix_to_sparse(d):
A = sparse.lil_matrix(d)
Does this make any sense? This is however not working and somehow I don't the way how get a sparse matrix. Should I better work with numpy arrays? What would be the best way to do this. I want to compare many ways to deal with matrices.
It would be nice if some could put me in the direction.
Upvotes: 4
Views: 3476
Reputation: 363807
Here's how you construct a document-term matrix A
from a set of documents in SciPy's COO format, which is a good tradeoff between ease of use and efficiency(*):
vocabulary = {} # map terms to column indices
data = [] # values (maybe weights)
row = [] # row (document) indices
col = [] # column (term) indices
for i, doc in enumerate(documents):
for term in doc:
# get column index, adding the term to the vocabulary if needed
j = vocabulary.setdefault(term, len(vocabulary))
data.append(1) # uniform weights
row.append(i)
col.append(j)
A = scipy.sparse.coo_matrix((data, (row, col)))
Now, to get a cooccurrence matrix:
A.T * A
(ignore the diagonal, which holds cooccurrences of term with themselves, i.e. squared frequency).
Alternatively, use some package that does this kind of thing for you, such as Gensim or scikit-learn. (I'm a contributor to both projects, so this might not be unbiased advice.)
Upvotes: 7