Reputation: 357
I'm programming a BOW code for a dataset of 30000 rows. I have X_train which is (21000, 2). Those 2 rows are: title and description. SO, I have the following code:
def text_to_bow(text: str) -> np.array:
text = text.split()
res = np.zeros(len(bow_vocabulary)) #bow_vocabulary includes 10000 most popular tokens
for word in text:
for i in range(len(bow_vocabulary)):
if word == bow_vocabulary[i]:
res[i] += 1
return res
def items_to_bow(items: np.array) -> np.array:
desc_index = 1
res = np.empty((0,k), dtype='uint8')
for i in range(len(items)):
description = items[i][desc_index]
temp = text_to_bow(description)
res = np.append(res, [temp], axis=0)
return np.array(res)
My code seems to work right as there are several asserts in my task.
So, when I run:
X_train_bow = items_to_bow(X_train)
I get the error:
MemoryError: Unable to allocate 12.1 MiB for an array with shape (158,
10000) and data type float64
I've already set overcommit_memory to 1 in Ubuntu, but it hasn't helped. I dont/t want to use 64bit python as well because there may be problems with modules.
I've also tried another function (with regular arrays):
def items_to_bow(items: np.array) -> np.array:
desc_index = 1
res = []
for i in range(len(items)):
description = items[i][desc_index]
temp = text_to_bow(description)
res.append(temp)
if len(res)//1000 > 0:
print(len(res))
return np.array(res)
But it seems to be working for an hour or so, which is not convenient.
Are there any ways to solve the problem? Would be grateful for any possible help.
Upvotes: 1
Views: 102
Reputation: 534
Do chunking. In pandas, you use the chunksize
param for this. Read a data chunk. Process the data. Append your output to a file. Make sure the chunk is deleted. Repeat.
Upvotes: 1