Vishal Panjwani
Vishal Panjwani

Reputation: 23

Memory error while creating large one hot encoding for lstm

I am trying to build a character level lstm model using keras and for that I need to create one hot encoding for characters to feed in the model. And I have around 1000 characters in each line with around 160,000 lines.

I tried to create a numpy array of zeros and make the corresponding entries 1, but I am geting memory error due to large size of the matrix is there any other way to do this.

Upvotes: 1

Views: 1812

Answers (2)

Uzay Macar
Uzay Macar

Reputation: 264

Perhaps an easier and more intuitive solution is to add a custom one-hot encoding layer in your Keras model architecture.

def build_model(self, batch_size, print_summary=False):
    X = Input(shape=(self.sequence_length,), batch_size=batch_size)
    embedding = OneHotEncoding(num_classes=self.vocab_size+1, 
                               sequence_length=self.sequence_length)(X)
    encoder = Bidirectional(CuDNNLSTM(units=self.recurrent_units, 
                                      return_sequences=True))(embedding) 
    ...

where we can define the OneHotEncoding layer as follows:

from tensorflow.keras.layers import Lambda
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Layer # for creating custom layers

class OneHotEncoding(Layer):
     def __init__(self, num_classes=None, sequence_length=None):
         if num_classes is None or sequence_length is None:
             raise ValueError("Can't leave params @num_classes or @sequence_length empty")
         super(OneHotEncoding, self).__init__()
         self.num_classes = num_classes
         self.sequence_length = sequence_length
     def encode(self, inputs):
         return K.one_hot(indices=inputs,
                          num_classes=self.num_classes)
     def call(self, inputs):
         return Lambda(function=self.encode,
                       input_shape=(self.sequence_length,))(inputs)

Here we are utilizing the fact that the Keras model is fed the training samples in appropriate batch sizes (with the standard fit function), which in turn doesn't yield a MemoryError.

Upvotes: 0

scnerd
scnerd

Reputation: 6103

Sure:

  1. Create batches. Only process, say, 10,000 entries (characters) at a time, computing and feeding them into your neural network just before they're needed (say, by using a generator instead of a list). Keras has a fit_generator training function to do this.

  2. Group chunks of data together. Say, instead of a line being a matrix of the one-hot encodings of its characters, instead use the sum/max of all those columns to produce a single vector for the line. Now, each line is only a single vector, with dimensionality equal to the number of unique characters in your data set. E.g., instead of [[0, 0, 1], [0, 1, 0], [0, 0, 1]], use, [0, 1, 1] to represent the entire line.

Upvotes: 1

Related Questions