Reputation: 13778
I am trying to use 2D CNN to do text classification on Chinese articles and have some trouble of keras Convolution2D
. I know the basic flow of Convolution2D
to cope with image, but stuck by using my dataset with keras. This is one of my problems:
9800 Chinese Article.
negative article and non-negtive article[please note it may be positive or neutrality] , just a binary classification problem. I have a test on Convolution1D
NN, but the result is not good.
Use tokenizer and word2vec to transform to a shape (9800, 6810, 200)
.
longest article has 6810 word, shortest article has less 50 word, need padding all article to 6810, 200 is word2vec size(Seems some people call it embedding_size ? ). Format like:
1 [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
2 [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
....
9999 [[word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200], [word2vec size=200]]
Is the article max. word length 6810 too large?
I have to reduce the 9800 samples to 6500 to avoid a MemoryError
, because 6500 already eats all my 32GB RAM. Any way to optimize the memory usage except trimming all the articles to a shorter length?
Upvotes: 4
Views: 2158
Reputation: 57709
The Keras FAQ already answers this question partly. You can load your data in chunks using model.fit_generator()
. The generator runs in a separate thread and produces your mini-batches, possibly loading them from your archive one-by-one, avoiding loading everything into RAM at once.
The code for using this would roughly look like this:
def train_generator():
while True:
chunk = read_next_chunk_of_data()
x,y = extract_training_data_from_chunk(chunk)
yield (x,y)
model.fit_generator(generator=train_generator())
Upvotes: 5