GEORGE GUO
GEORGE GUO

Reputation: 117

Python - calculate the co-occurrence matrix

I'm working on an NLP task and I need to calculate the co-occurrence matrix over documents. The basic formulation is as below:

Here I have a matrix with shape (n, length), where each row represents a sentence composed by length words. So there are n sentences with same length in all. Then with a defined context size, e.g., window_size = 5, I want to calculate the co-occurrence matrix D, where the entry in the cth row and wth column is #(w,c), which means the number of times that a context word c appears in w's context.

An example can be referred here. How to calculate the co-occurrence between two words in a window of text?

I know it can be calculate by stacking loops, but I want to know if there exits an simple way or simple function? I have find some answers but they cannot work with a window sliding through the sentence. For example:word-word co-occurrence matrix

So could anyone tell me is there any function in Python can deal with this problem concisely? Cause I think this task is quite common in NLP things.

Upvotes: 9

Views: 23424

Answers (2)

Shrinivas Ambiger
Shrinivas Ambiger

Reputation: 1

I have calcuated the Cooccurence matrix with window size =2

  1. first write a function which gives correct neighbourhood words (here i have used get context)

  2. Create matrix and just add 1 if the particuar value present in the neighbour hood.

Here is the python code:

import numpy as np
CORPUS=["abc def ijk pqr", "pqr klm opq", "lmn pqr xyz abc def pqr abc"]


top2000 = [ "abc","pqr","def"]#list(set((' '.join(ctxs)).split(' ')))
a = np.zeros((3,3), np.int32)
for  sentence in CORPUS:
    for index,word in enumerate(sentence.split(' ')):
       if word in top2000 : 
           print(word)
           context=GetContext(sentence,index)
           print(context)
           for word2 in context:
             if word2 in top2000:
                 a[top2000.index(word)][top2000.index(word2)]+=1
print(a)

get context function

def GetContext(sentence, index):
words = sentence.split(' ')
ret=[]
for word in words:

        if index==0:
            ret.append(words[index+1])
            ret.append(words[index+2])
        elif index==1:
            ret.append(words[index-1])
            ret.append(words[index+1])
        if len(words)>3:
                ret.append(words[index+2])
        elif index==(len(words)-1):
            ret.append(words[index-2])
            ret.append(words[index-1])
        elif index==(len(words)-2):
            ret.append(words[index-2])
            ret.append(words[index-1])
            ret.append(words[index+1])
        else:
            ret.append(words[index-2])
            ret.append(words[index-1])
            ret.append(words[index+1])
            ret.append(words[index+2])
        return ret     

here is result:

array([[0, 3, 3],
   [3, 0, 2],
   [3, 2, 0]])

Upvotes: 0

Zealseeker
Zealseeker

Reputation: 823

It is not that complicated, I think. Why not make a function for yourself? First get the co-occurrence matrix X according to this tutorial: http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage Then for each sentence, calculate the co-occurrence and add them to a summary variable.

m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
    for i,word in enumerate(sentence):
        for j in range(max(i-window,0),min(i+window,length)):
             m[word,sentence[j]]+=1
for sentence in X:
    cal_occ(sentence, m)

Upvotes: 10

Related Questions