DavidR
DavidR

Reputation: 41

word co-occurrence matrix from gensim

When building a python gensim word2vec model, is there a way to see a doc-to-word matrix?

With input of sentences = [['first', 'sentence'], ['second', 'sentence']] I'd see something like*:

      first  second  sentence
doc0    1       0        1
doc1    0       1        1

*I've illustrated 'human readable', but I'm looking for a scipy (or other) matrix, indexed to model.wv.index2word.

And, can that be transformed into a word-to-word matrix (to see co-occurences)? Something like:

          first  second  sentence
first       1       0        1
second      0       1        1  
sentence    1       1        2   

I've already implemented something like word-word co-occurrence matrix using CountVectorizer. It works well. However, I'm already using gensim in my pipeline and speed/code simplicity matter for my use-case.

Upvotes: 4

Views: 4641

Answers (2)

DavidR
DavidR

Reputation: 41

The doc-word to word-word transform turns out to be more complex (for me at least) than I'd originally supposed. np.dot() is a key to its solution, but I need to apply a mask first. I've created a more complex example for testing...

Imagine a doc-word matrix

#       word1  word2  word3
# doc0    3      4      2
# doc1    6      1      0
# doc3    8      0      4 
  • in docs were word2 occurs, word1 occurs 9 times
  • in docs were word2 occurs, word2 occurs 5 times
  • in docs were word2 occurs, word3 occurs 2 times

So, when we're done we should end up with something like the below (or it's inverse). Reading in columns, the word-word matrix becomes:

#       word1  word2  word3
# word1  17      9     11
# word2   5      5      4
# word3   6      2      6

A straight np.dot() product yields:

import numpy as np
doc2word = np.array([[3,4,2],[6,1,0],[8,0,4]])
np.dot(doc2word,doc2word.T)
# array([[29, 22, 32],
#        [22, 37, 48],
#        [32, 48, 80]])

which implies that word1 occurs with itself 29 times.

But if, instead of multiplying doc2word times itself, I first build a mask, I get closer. Then I need to reverse the order of the arguments:

import numpy as np
doc2word = np.array([[3,4,2],[6,1,0],[8,0,4]])
# a mask where all values greater than 0 are true
# so when this is multiplied by the orig matrix, True = 1 and False = 0
doc2word_mask = doc2word > 0  

np.dot(doc2word.T, doc2word_mask)
# array([[17,  9, 11],
#        [ 5,  5,  4],
#        [ 6,  2,  6]])

I've been thinking about this for too long....

Upvotes: 0

Syncrossus
Syncrossus

Reputation: 626

Given a corpus that is a list of lists of words, what you want to do is create a Gensim Dictionary, change your corpus to bag-of-words and then create your matrix :

from gensim.matutils import corpus2csc
from gensim.corpora import Dictionary

# somehow create your corpus

dct = Dictionary(corpus)
bow_corpus = [dct.doc2bow(line) for line in corpus]
term_doc_mat = corpus2csc(bow_corpus)

Your term_doc_mat is a Numpy compressed sparse matrix. If you want a term-term matrix, you can always multiply it by its transpose, i.e. :

import numpy as np
term_term_mat = np.dot(term_doc_mat, term_doc_mat.T)

Upvotes: 2

Related Questions