Reputation: 117
I'm working on an NLP task and I need to calculate the co-occurrence matrix over documents. The basic formulation is as below:
Here I have a matrix with shape (n, length)
, where each row represents a sentence composed by length
words. So there are n
sentences with same length in all. Then with a defined context size, e.g., window_size = 5
, I want to calculate the co-occurrence matrix D
, where the entry in the cth
row and wth
column is #(w,c)
, which means the number of times that a context word c
appears in w
's context.
An example can be referred here. How to calculate the co-occurrence between two words in a window of text?
I know it can be calculate by stacking loops, but I want to know if there exits an simple way or simple function? I have find some answers but they cannot work with a window sliding through the sentence. For example:word-word co-occurrence matrix
So could anyone tell me is there any function in Python can deal with this problem concisely? Cause I think this task is quite common in NLP things.
Upvotes: 9
Views: 23424
Reputation: 1
I have calcuated the Cooccurence matrix with window size =2
first write a function which gives correct neighbourhood words (here i have used get context)
Create matrix and just add 1 if the particuar value present in the neighbour hood.
Here is the python code:
import numpy as np
CORPUS=["abc def ijk pqr", "pqr klm opq", "lmn pqr xyz abc def pqr abc"]
top2000 = [ "abc","pqr","def"]#list(set((' '.join(ctxs)).split(' ')))
a = np.zeros((3,3), np.int32)
for sentence in CORPUS:
for index,word in enumerate(sentence.split(' ')):
if word in top2000 :
print(word)
context=GetContext(sentence,index)
print(context)
for word2 in context:
if word2 in top2000:
a[top2000.index(word)][top2000.index(word2)]+=1
print(a)
get context function
def GetContext(sentence, index):
words = sentence.split(' ')
ret=[]
for word in words:
if index==0:
ret.append(words[index+1])
ret.append(words[index+2])
elif index==1:
ret.append(words[index-1])
ret.append(words[index+1])
if len(words)>3:
ret.append(words[index+2])
elif index==(len(words)-1):
ret.append(words[index-2])
ret.append(words[index-1])
elif index==(len(words)-2):
ret.append(words[index-2])
ret.append(words[index-1])
ret.append(words[index+1])
else:
ret.append(words[index-2])
ret.append(words[index-1])
ret.append(words[index+1])
ret.append(words[index+2])
return ret
here is result:
array([[0, 3, 3],
[3, 0, 2],
[3, 2, 0]])
Upvotes: 0
Reputation: 823
It is not that complicated, I think. Why not make a function for yourself? First get the co-occurrence matrix X according to this tutorial: http://scikit-learn.org/stable/modules/feature_extraction.html#common-vectorizer-usage Then for each sentence, calculate the co-occurrence and add them to a summary variable.
m = np.zeros([length,length]) # n is the count of all words
def cal_occ(sentence,m):
for i,word in enumerate(sentence):
for j in range(max(i-window,0),min(i+window,length)):
m[word,sentence[j]]+=1
for sentence in X:
cal_occ(sentence, m)
Upvotes: 10