Reputation: 2268
I'm trying to implement word2vec using skipgram from scratch and got stuck on creating the input layer
class SkipGramBatcher:
def __init__(self, text):
self.text = text.results
def get_batches(self, batch_size):
n_batches = len(self.text)//batch_size
pairs = []
for idx in range(0, len(self.text)):
window_size = 5
idx_neighbors = self._get_neighbors(self.text, idx, window_size)
idx_pairs = [(idx,idx_neighbor) for idx_neighbor in idx_neighbors]
pairs.extend(idx_pairs)
for idx in range(0, len(pairs), batch_size):
X = [pair[0] for pair in pairs[idx:idx+batch_size]]
Y = [pair[1] for pair in pairs[idx:idx+batch_size]]
yield X,Y
def _get_neighbors(self, text, idx, window_size):
text_length = len(text)
start = max(idx-window_size,0)
end = min(idx+window_size+1,text_length)
neighbors_words = set(text[start:end])
return list(neighbors_words)
For testing purposes I've limited my vocab_size to 1000 words.
When I try to test my SkipGramBatcher
I get out of free RAM memory and my colab restarts.
for x,y in skip_gram_batcher.get_batches(64):
x_ohe = to_one_hot(x)
y_ohe = to_one_hot(y)
print(x_one.shape, y_ohe.shape)
def to_one_hot(indexes):
n_values = np.max(indexes) + 1
return np.eye(n_values)[indexes]
I guess I do something in the wrong way, any help is appreciated.
The Google Colab message:
Mar 5, 2019, 4:47:33 PM WARNING WARNING:root:kernel fee9eac6-2adf-4c31-9187-77e8018e2eae restarted
Mar 5, 2019, 4:47:33 PM INFO KernelRestarter: restarting kernel (1/5), keep random ports
Mar 5, 2019, 4:47:23 PM WARNING tcmalloc: large alloc 66653388800 bytes == 0x27b4c000 @ 0x7f4533736001 0x7f4527e29b85 0x7f4527e8cb43 0x7f4527e8ea86 0x7f4527f26868 0x5030d5 0x507641 0x504c28 0x502540 0x502f3d 0x506859 0x502209 0x502f3d 0x506859 0x504c28 0x511eca 0x502d6f 0x506859 0x504c28 0x502540 0x502f3d 0x506859 0x504c28 0x502540 0x502f3d 0x507641 0x504c28 0x501b2e 0x591461 0x59ebbe 0x507c17
Mar 5, 2019, 4:39:43 PM INFO Adapting to protocol v5.1 for kernel fee9eac6-2adf-4c31-9187-77e8018e2eae
Upvotes: 1
Views: 1138
Reputation: 969
I think I got it why Google colab allocates a whopping 66GB to your program.
Since X gets allocated the batch size of elements
X = [pair[0] for pair in pairs[idx:idx+batch_size]]
when converting to one_hot_encoding
n_values = np.max(indexes) + 1
return np.eye(n_values)[indexes]
X gets assigned a matrix of dimension (64,64) and since the indexes also are from (0:63). It essentially returns (64,64) matrix.
Warning:- This is only for x consider y too.
Now repeat this process for like N times. Each time X & Y are (64,64) matrices and also there is pairs variable which also is a big list and so memory keeps on increasing.
Hint:- Y is list of strings and np.max(Y) cannot be done.
Upvotes: 2