El_Patrón
El_Patrón

Reputation: 533

How to convert co-occurrence matrix to sparse matrix

I am starting dealing with sparse matrices so I'm not really proficient on this topic. My problem is, I have a simple coo-occurrences matrix from a word list, just a 2-dimensional co-occurrence matrix word by word counting how many times a word occurs in same context. The matrix is quite sparse since the corpus is not that big. I want to convert it to a sparse matrix to be able to deal better with it, eventually do some matrix multiplication afterwards. Here what I have done until now (only the first part, the rest is just output format and cleaning data):

def matrix(from_corpus):    
d = defaultdict(lambda : defaultdict(int))
        heads = set() 
        trans = set()
        for text in corpus:
            d[text[0]][text[1]] += 1
            heads.add(text[0])
            trans.add(text[1])

        return d,heads,trans

My idea would be to make a new function:

def matrix_to_sparse(d):
    A = sparse.lil_matrix(d)

Does this make any sense? This is however not working and somehow I don't the way how get a sparse matrix. Should I better work with numpy arrays? What would be the best way to do this. I want to compare many ways to deal with matrices.

It would be nice if some could put me in the direction.

Upvotes: 4

Views: 3476

Answers (1)

Fred Foo
Fred Foo

Reputation: 363807

Here's how you construct a document-term matrix A from a set of documents in SciPy's COO format, which is a good tradeoff between ease of use and efficiency(*):

vocabulary = {}  # map terms to column indices
data = []        # values (maybe weights)
row = []         # row (document) indices
col = []         # column (term) indices

for i, doc in enumerate(documents):
    for term in doc:
        # get column index, adding the term to the vocabulary if needed
        j = vocabulary.setdefault(term, len(vocabulary))
        data.append(1)  # uniform weights
        row.append(i)
        col.append(j)

A = scipy.sparse.coo_matrix((data, (row, col)))

Now, to get a cooccurrence matrix:

A.T * A

(ignore the diagonal, which holds cooccurrences of term with themselves, i.e. squared frequency).

Alternatively, use some package that does this kind of thing for you, such as Gensim or scikit-learn. (I'm a contributor to both projects, so this might not be unbiased advice.)

Upvotes: 7

Related Questions