Reputation: 22336
The implementation did not consider multiple occurrences of a same word, and self word occurrences.
For instance when stride=2 and the word at the position is W, co-occurrence of X needs +2, self-co-occurrence of W needs +1.
X|Y|W|X|W
To update the m * m
matrix (co_occurance_matrix), currently accessing row by row with a loop. The entire code is at the bottom.
How can I remove the loop and update the multiple rows all at once? I believe there should be a way to combine each index into one matrix that replaces the loop with one vectorized update.
Please advice possible approaches.
for position in range(0, n):
co_ccurrence_matrix[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
] += 1
sequence
(word index is an integer code for each word).position
in the loop, check the co-occurring words on both sides within a stride
distance. context
window as in the purple box in the diagram. N = context_size = stride*2 + 1
.co_occurrence_matrix
as per blue lines in the diagram.It seems the Integer array indexing may be a way to access multiple rows at the same time.
x = np.array([[ 0, 1, 2],
[ 3, 4, 5],
[ 6, 7, 8],
[ 9, 10, 11]])
rows = np.array([[0, 0],
[3, 3]], dtype=np.intp)
columns = np.array([[0, 2],
[0, 2]], dtype=np.intp)
x[rows, columns]
---
array([[ 0, 2],
[ 9, 11]])
Create a multi-dimensional indices by combining each index in the loop, but it does not work with the error. Please advise the cause and the mistakes, or if the attempt does not make sense.
indices = np.array([
[
sequence[0], # position to the word
sequence[max(0, 0-stride) : min((0+stride),n-1) +1] # positions to co-occurrence words
]]
)
assert n > 1
for position in range(1, n):
co_occurrence_indices = np.array([
[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
]]
)
indices = np.append(
indices,
co_occurrence_indices,
axis=0
)
print("Updating the co_occurrence_matrix: indices \n{} \nindices.dtype {}".format(
indices,
indices.dtype
))
co_ccurrence_matrix[
indices <---- Error
] += 1
Updating the co_occurrence_matrix: indices
[[0 array([0, 1])]
[1 array([0, 1, 2])]
[2 array([1, 2, 3])]
[3 array([2, 3, 0])]
[0 array([3, 0, 1])]
[1 array([0, 1, 4])]
[4 array([1, 4, 5])]
[5 array([4, 5, 6])]
[6 array([5, 6, 7])]
[7 array([6, 7])]]
indices.dtype object
<ipython-input-88-d9b081bf2f1a>:48: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
indices = np.array([
<ipython-input-88-d9b081bf2f1a>:56: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
co_occurrence_indices = np.array([
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-88-d9b081bf2f1a> in <module>
84 sequence, word_to_id, id_to_word = preprocess(corpus)
85 vocabrary_size = max(word_to_id.values()) + 1
---> 86 create_cooccurrence_matrix(sequence, vocabrary_size , 3)
<ipython-input-88-d9b081bf2f1a> in create_cooccurrence_matrix(sequence, vocabrary_size, context_size)
70 indices.dtype
71 ))
---> 72 co_ccurrence_matrix[
73 indices
74 ] += 1
IndexError: arrays used as indices must be of integer (or boolean) type
import numpy as np
def preprocess(text):
"""
Args:
text: A string including sentences to process. corpus
Returns:
sequence:
A numpy array of word indices to every word in the original text as they appear in the text.
The objective of corpus is to preserve the original text but as numerical indices.
word_to_id: A dictionary to map a word to a word index
id_to_word: A dictionary to map a word index to a word
"""
text = text.lower()
text = text.replace('.', ' .')
words = text.split(' ')
word_to_id = {}
id_to_word = {}
for word in words:
if word not in word_to_id:
new_id = len(word_to_id)
word_to_id[word] = new_id
id_to_word[new_id] = word
sequence= np.array([word_to_id[w] for w in words])
return sequence, word_to_id, id_to_word
def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
"""
Args:
sequence: word index sequence of the original corpus text
vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
context_size: context (N-gram size N) within which to check co-occurrences.
"""
n = sequence_size = len(sequence)
co_ccurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
stride = int((context_size - 1)/2 )
assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
n, stride
)
for position in range(0, n):
co_ccurrence_matrix[
sequence[position], # position to the word
sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurrence words
] += 1
np.fill_diagonal(co_ccurrence_matrix, 0)
return co_ccurrence_matrix
corpus= "To be, or not to be, that is the question"
sequence, word_to_id, id_to_word = preprocess(corpus)
vocabrary_size = max(word_to_id.values()) + 1
create_cooccurrence_matrix(sequence, vocabrary_size , 3)
---
[[0 2 0 1 0 0 0 0]
[2 0 1 0 1 0 0 0]
[0 1 0 1 0 0 0 0]
[1 0 1 0 0 0 0 0]
[0 1 0 0 0 1 0 0]
[0 0 0 0 1 0 1 0]
[0 0 0 0 0 1 0 1]
[0 0 0 0 0 0 1 0]]
Used ptb.train.txt from enter link description here.
Timer unit: 1e-06 s
Total time: 23.0015 s
File: <ipython-input-8-27f5e530d4ff>
Function: create_cooccurrence_matrix at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def create_cooccurrence_matrix(sequence, vocabrary_size, context_size=3):
2 """
3 Args:
4 sequence: word index sequence of the original corpus text
5 vocabrary_size: number of words in the vocabrary (same with co-occurrence vector size)
6 context_size: context (N-gram size N) within to check co-occurrences.
7 Returns:
8 co_occurrence matrix
9 """
10 1 4.0 4.0 0.0 n = sequence_size = len(sequence)
11 1 98.0 98.0 0.0 co_occurrence_matrix = np.zeros((vocabrary_size, vocabrary_size), dtype=np.int32)
12
13 1 5.0 5.0 0.0 stride = int((context_size - 1)/2 )
14 1 1.0 1.0 0.0 assert(n > stride), "sequence_size {} is less than/equal to stride {}".format(
15 n, stride
16 )
17
18 """
19 # Handle position=slice(0 : (stride-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
20 # Handle position=slice((n-1-stride) : (n-1) +1), co-occurrences=slice(max(0, position-stride): min((position+stride),n-1) +1)
21 indices = [*range(0, (stride-1) +1), *range((n-1)-stride +1, (n-1) +1)]
22 #print(indices)
23
24 for position in indices:
25 debug(sequence, position, stride, False)
26 co_occurrence_matrix[
27 sequence[position], # position to the word
28 sequence[max(0, position-stride) : min((position+stride),n-1) +1] # indices to co-occurance words
29 ] += 1
30
31
32 # Handle position=slice(stride, ((sequence_size-1) - stride) +1)
33 for position in range(stride, (sequence_size-1) - stride + 1):
34 co_occurrence_matrix[
35 sequence[position], # position to the word
36 sequence[(position-stride) : (position + stride + 1)] # indices to co-occurance words
37 ] += 1
38 """
39
40 929590 1175326.0 1.3 5.1 for position in range(0, n):
41 2788767 15304643.0 5.5 66.5 co_occurrence_matrix[
42 1859178 2176964.0 1.2 9.5 sequence[position], # position to the word
43 929589 3280181.0 3.5 14.3 sequence[max(0, position-stride) : min((position+stride),n-1) +1] # positions to co-occurance words
44 929589 1062613.0 1.1 4.6 ] += 1
45
46 1 1698.0 1698.0 0.0 np.fill_diagonal(co_occurrence_matrix, 0)
47
48 1 2.0 2.0 0.0 return co_occurrence_matrix
Upvotes: 4
Views: 735
Reputation: 19307
EDIT: You could do this using inbuilt sklearn functions extremely easily, but seeing the history of your questions, I believe you are looking for a pure NumPy vectorized implementation.
IIUC, you want to create a co-occurrence matrix based on the context window around a word. So, if there are 12 words in a vocabulary, 100 sentences, and say a context size of 2, then you want to look at the rolling windows of 5 (2 left, 1 center, 2 right)
size in each of the sentences and iteratively (or vectorized) add the context words to get a (12, 12) matrix which tells you how many times a word occurs in the context window of another word.
You can do this in a completely vectorized manner as such (explanation in last section) -
#Definitions
sentences, vocab, length, context_size = 100, 12, 15, 2
#Create dummy corpus (label encoded)
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length)) #(100, 15)
#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window #(100, 11, 5)
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1] #(120, 8, 8)
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride) #(100, 11, 5)
#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size] #(100, 11)
context = np.delete(rolling_window, center_idx, -1) #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2) #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1])) #(12, 12)
np.fill_diagonal(cooccurence,0) #(12, 12)
print(cooccurence)
[[ 0. 94. 100. 114. 91. 92. 90. 128. 100. 114. 91. 84.]
[ 94. 0. 78. 96. 90. 65. 76. 68. 76. 108. 58. 68.]
[100. 78. 0. 125. 107. 93. 83. 84. 73. 84. 97. 110.]
[114. 96. 125. 0. 84. 97. 76. 110. 80. 94. 117. 97.]
[ 91. 90. 107. 84. 0. 84. 87. 103. 60. 127. 123. 97.]
[ 92. 65. 93. 97. 84. 0. 67. 87. 72. 87. 74. 92.]
[ 90. 76. 83. 76. 87. 67. 0. 83. 73. 118. 81. 108.]
[128. 68. 84. 110. 103. 87. 83. 0. 72. 100. 115. 69.]
[100. 76. 73. 80. 60. 72. 73. 72. 0. 83. 81. 100.]
[114. 108. 84. 94. 127. 87. 118. 100. 83. 0. 109. 110.]
[ 91. 58. 97. 117. 123. 74. 81. 115. 81. 109. 0. 104.]
[ 84. 68. 110. 97. 97. 92. 108. 69. 100. 110. 104. 0.]]
Let's test this on a single sentence corpus to be or not to be that is the question
sentence = 'to be or not to be that is the question'
corpus = np.array([[0, 1, 2, 3, 0, 1, 4, 5, 6, 7]])
#Definitions
vocab, context_size = 8, 2
window = context_size*2+1
#Create rolling window view of the sequences
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride)
#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size]
context = np.delete(rolling_window, center_idx, -1)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1]))
np.fill_diagonal(cooccurence,0)
print(cooccurence)
[[0. 5. 1. 3. 1. 2. 1. 0.]
[5. 0. 3. 2. 2. 1. 2. 1.]
[1. 3. 0. 1. 1. 0. 0. 0.]
[3. 2. 1. 0. 2. 1. 0. 0.]
[1. 2. 1. 2. 0. 1. 1. 1.]
[2. 1. 0. 1. 1. 0. 1. 0.]
[1. 2. 0. 0. 1. 1. 0. 1.]
[0. 1. 0. 0. 1. 0. 1. 0.]]
Let's start with creating some label encoded dummy data. Here there are 100
sentences, with a vocab of 12
size. The length of each sentence is 15
and the window that I am taking is 5 (2+1+2)
-
sentences, vocab, length, context_size = 100, 12, 15, 2
window = context_size*2+1
corpus = np.random.randint(0, vocab, (sentences, length))
corpus[0:2]
#top 2 sentences
array([[ 9, 8, 9, 4, 2, 10, 9, 0, 7, 1, 11, 0, 7, 3, 1],
[ 7, 9, 4, 0, 1, 9, 10, 7, 4, 2, 2, 3, 5, 8, 8]])
Next, we want to create rolling window views of the window size so that we can then get to our next stages. The shape of this new view would be equal to (sentences, number of windows, window size)
and so using stride_tricks
we can create a rolling window view of this matrix quite easily.
#Create shape and stride definitions
shape = corpus.shape[0], corpus.shape[1]-window+1, window
stride = corpus.strides[0], corpus.strides[1], corpus.strides[1]
print(shape, stride)
#create view
rolling_window = np.lib.stride_tricks.as_strided(corpus, shape=shape, strides=stride) #(100, 11, 5)
print('\nView for first sequence ->')
print(rolling_window[0])
(100, 11, 5) (120, 8, 8)
View for first sequence ->
[[ 9 8 9 4 2]
[ 8 9 4 2 10]
[ 9 4 2 10 9]
[ 4 2 10 9 0]
[ 2 10 9 0 7]
[10 9 0 7 1]
[ 9 0 7 1 11]
[ 0 7 1 11 0]
[ 7 1 11 0 7]
[ 1 11 0 7 3]
[11 0 7 3 1]]
Next let's look at only a single sentence first and get that into a co-occurance matrix. After that we can scale it to a higher dimension matrix.
For a SINGLE SENTENCE, we can do the following steps -
np.eye(vocab)
and filtering for the context labels(word, word)
co-occurence matrix from the multi-hot context vectors for each window.position = rolling_window[0][:,2]
context = np.delete(rolling_window[0], 2, 1)
context_multihot = np.sum(np.eye(vocab)[context], axis=1)
cooccurence = context_multihot.T@context_multihot
np.fill_diagonal(cooccurence,0)
print(cooccurence)
[[0. 3. 2. 1. 1. 0. 0. 5. 0. 2. 1. 4.]
[3. 0. 0. 2. 0. 0. 0. 4. 0. 2. 1. 3.]
[2. 0. 0. 0. 2. 0. 0. 1. 2. 3. 2. 0.]
[1. 2. 0. 0. 0. 0. 0. 1. 0. 0. 0. 2.]
[1. 0. 2. 0. 0. 0. 0. 0. 1. 4. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
[5. 4. 1. 1. 0. 0. 0. 0. 0. 1. 2. 2.]
[0. 0. 2. 0. 1. 0. 0. 0. 0. 2. 1. 0.]
[2. 2. 3. 0. 4. 0. 0. 1. 2. 0. 4. 1.]
[1. 1. 2. 0. 1. 0. 0. 2. 1. 4. 0. 0.]
[4. 3. 0. 2. 0. 0. 0. 2. 0. 1. 0. 0.]]
We have now been able to do the whole thing for 1 sentence. Now we just have to scale to 100 sentences without for loops. For this only a few things need to change.
context_multihot
over last 2 axis before dot productnp.dot
to np.tensordot
so that we can reduce specified axis. In this case, we have to perform (100, 12, 11) @ (100, 11, 12) -> (12, 12)
. So select axis accordingly.#Creating co-occurence matrix based on context window
center_idx = context_size
#position = rolling_window[:,:,context_size] #(100, 11)
context = np.delete(rolling_window, center_idx, -1) #(100, 11, 4)
context_multihot = np.sum(np.eye(vocab)[context], axis=-2) #(100, 11, 12)
cooccurence = np.tensordot(context_multihot.transpose(0,2,1), context_multihot, axes=([0,2],[0,1])) #(12, 12)
np.fill_diagonal(cooccurence,0) #(12, 12)
print(cooccurence)
[[ 0. 94. 100. 114. 91. 92. 90. 128. 100. 114. 91. 84.]
[ 94. 0. 78. 96. 90. 65. 76. 68. 76. 108. 58. 68.]
[100. 78. 0. 125. 107. 93. 83. 84. 73. 84. 97. 110.]
[114. 96. 125. 0. 84. 97. 76. 110. 80. 94. 117. 97.]
[ 91. 90. 107. 84. 0. 84. 87. 103. 60. 127. 123. 97.]
[ 92. 65. 93. 97. 84. 0. 67. 87. 72. 87. 74. 92.]
[ 90. 76. 83. 76. 87. 67. 0. 83. 73. 118. 81. 108.]
[128. 68. 84. 110. 103. 87. 83. 0. 72. 100. 115. 69.]
[100. 76. 73. 80. 60. 72. 73. 72. 0. 83. 81. 100.]
[114. 108. 84. 94. 127. 87. 118. 100. 83. 0. 109. 110.]
[ 91. 58. 97. 117. 123. 74. 81. 115. 81. 109. 0. 104.]
[ 84. 68. 110. 97. 97. 92. 108. 69. 100. 110. 104. 0.]]
Upvotes: 1