Mithril
Mithril

Reputation: 13778

Any way to optimize large input's memory usage in keras?

I am trying to use 2D CNN to do text classification on Chinese articles and have some trouble of keras Convolution2D. I know the basic flow of Convolution2D to cope with image, but stuck by using my dataset with keras. This is one of my problems:

Dataset

  1. 9800 Chinese Article.

    negative article and non-negtive article[please note it may be positive or neutrality] , just a binary classification problem. I have a test on Convolution1D NN, but the result is not good.

  2. Use tokenizer and word2vec to transform to a shape (9800, 6810, 200).

    longest article has 6810 word, shortest article has less 50 word, need padding all article to 6810, 200 is word2vec size(Seems some people call it embedding_size ? ). Format like:

     1     [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
     2     [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
     ....
     9999  [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
    

Is the article max. word length 6810 too large? I have to reduce the 9800 samples to 6500 to avoid a MemoryError, because 6500 already eats all my 32GB RAM. Any way to optimize the memory usage except trimming all the articles to a shorter length?

Upvotes: 4

Views: 2158

Answers (1)

nemo
nemo

Reputation: 57709

The Keras FAQ already answers this question partly. You can load your data in chunks using model.fit_generator(). The generator runs in a separate thread and produces your mini-batches, possibly loading them from your archive one-by-one, avoiding loading everything into RAM at once.

The code for using this would roughly look like this:

def train_generator():
    while True:
        chunk = read_next_chunk_of_data()
        x,y = extract_training_data_from_chunk(chunk)
        yield (x,y)

 model.fit_generator(generator=train_generator())

Upvotes: 5

Related Questions