Reputation: 23
I am trying to build a character level lstm model using keras and for that I need to create one hot encoding for characters to feed in the model. And I have around 1000 characters in each line with around 160,000 lines.
I tried to create a numpy array of zeros and make the corresponding entries 1, but I am geting memory error due to large size of the matrix is there any other way to do this.
Upvotes: 1
Views: 1812
Reputation: 264
Perhaps an easier and more intuitive solution is to add a custom one-hot encoding layer in your Keras model architecture.
def build_model(self, batch_size, print_summary=False):
X = Input(shape=(self.sequence_length,), batch_size=batch_size)
embedding = OneHotEncoding(num_classes=self.vocab_size+1,
sequence_length=self.sequence_length)(X)
encoder = Bidirectional(CuDNNLSTM(units=self.recurrent_units,
return_sequences=True))(embedding)
...
where we can define the OneHotEncoding
layer as follows:
from tensorflow.keras.layers import Lambda
from tensorflow.keras import backend as K
from tensorflow.keras.layers import Layer # for creating custom layers
class OneHotEncoding(Layer):
def __init__(self, num_classes=None, sequence_length=None):
if num_classes is None or sequence_length is None:
raise ValueError("Can't leave params @num_classes or @sequence_length empty")
super(OneHotEncoding, self).__init__()
self.num_classes = num_classes
self.sequence_length = sequence_length
def encode(self, inputs):
return K.one_hot(indices=inputs,
num_classes=self.num_classes)
def call(self, inputs):
return Lambda(function=self.encode,
input_shape=(self.sequence_length,))(inputs)
Here we are utilizing the fact that the Keras model is fed the training samples in appropriate batch sizes (with the standard fit
function), which in turn doesn't yield a MemoryError
.
Upvotes: 0
Reputation: 6103
Sure:
Create batches. Only process, say, 10,000 entries (characters) at a time, computing and feeding them into your neural network just before they're needed (say, by using a generator instead of a list). Keras has a fit_generator
training function to do this.
Group chunks of data together. Say, instead of a line being a matrix of the one-hot encodings of its characters, instead use the sum/max of all those columns to produce a single vector for the line. Now, each line is only a single vector, with dimensionality equal to the number of unique characters in your data set. E.g., instead of [[0, 0, 1], [0, 1, 0], [0, 0, 1]]
, use, [0, 1, 1]
to represent the entire line.
Upvotes: 1