How to solve allocate memory problem while working with huge datasets in python?

Question

I'm programming a BOW code for a dataset of 30000 rows. I have X_train which is (21000, 2). Those 2 rows are: title and description. SO, I have the following code:

def text_to_bow(text: str) -> np.array:
    text = text.split()
    res = np.zeros(len(bow_vocabulary)) #bow_vocabulary includes 10000 most popular tokens
    for word in text:
        for i in range(len(bow_vocabulary)):
            if word == bow_vocabulary[i]:
                res[i] += 1
    return res
def items_to_bow(items: np.array) -> np.array:
    desc_index = 1
    res = np.empty((0,k), dtype='uint8')
    for i in range(len(items)):
        description = items[i][desc_index]
        temp = text_to_bow(description)
        res = np.append(res, [temp], axis=0)
    return np.array(res)

My code seems to work right as there are several asserts in my task.

So, when I run:

X_train_bow = items_to_bow(X_train)

I get the error:

MemoryError: Unable to allocate 12.1 MiB for an array with shape (158,
10000) and data type float64

I've already set overcommit_memory to 1 in Ubuntu, but it hasn't helped. I dont/t want to use 64bit python as well because there may be problems with modules.

I've also tried another function (with regular arrays):

def items_to_bow(items: np.array) -> np.array:
    desc_index = 1
    res = []
    for i in range(len(items)):
        description = items[i][desc_index]
        temp = text_to_bow(description)
        res.append(temp)
        if len(res)//1000 > 0:
            print(len(res))
    return np.array(res)

But it seems to be working for an hour or so, which is not convenient.

Are there any ways to solve the problem? Would be grateful for any possible help.

How to solve allocate memory problem while working with huge datasets in python?

Answers (1)

Related Questions