kiddo
kiddo

Reputation: 135

Keras: Feed pre-trained embeddings as input instead of loading weights in Embedding layer

I'm using Keras library for sequence labeling. I'm already using pre-trained embeddings for my experiments, using a methodology like this (https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html)

MY CODE (EMBEDDINGS SAVED INTERNALLY):

    self._model = Sequential(name='core_sequential')
    self._model.add(Embedding(input_dim=weights.shape[0], 
                              output_dim=weights.shape[1],
                              weights=[weights],
                              name="embeddings_layer",trainable=False))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_1'))
    self._model.add(Bidirectional(LSTM(output_dim=300,
                                       return_sequences=distributed,
                                       activation="tanh",
                                       name="lstm_layer"),name='birnn_layer'))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_2'))
    self._model.add(TimeDistributed(Dense(output_dim=1,
                                          activation='sigmoid',
                                          name='dense_layer'), 
                    name="timesteps_layer"))
    self._model.compile(optimizer=Adam(lr=lr),
                        loss='binary_crossentropy', 
                        metrics=['accuracy'])

This is working perfectly fine, we just have to feed a nd-array of (X,max_sequence_size) shape, which is actually X padded sequences of max_sequence_size time-steps (word indices).

Saving pre-trained embeddings internally on the model is totally scaling out model's size (450MB per model). If someone wants to use this architecture for multiple models on his own system, let's say 20 of them, he needs approx. 10GB to save all models! The bottle neck in this case is that each model has saved internally the word embedding weights, while they are always the same.

Trying to find a sufficient way to decrease model's size, I thought it would be better to load the actual feature vector (embeddings) externally. , which means to load a nd-array of (X,max_sequence_size, embeddings_size) shape, which is actually X padded sequences of max_sequence_size time-steps of the actual embeddings.

I can't find any discussion of this important issue. In Keras documentation, embeddings seems to be the only available choice in RNNs, keras community seems to underestimate this memory issue. I tried to figure out a solution.

SOLUTION (EMBEDDINGS LOADED EXTERNALLY):

    self._model = Sequential(name='core_sequential')
    self._model.add(InputLayer((None, 200)))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_1'))
    self._model.add(Bidirectional(LSTM(output_dim=300,
                                       return_sequences=distributed,
                                       activation="tanh",
                                       name="lstm_layer"),name='birnn_layer'))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_2'))
    self._model.add(TimeDistributed(Dense(output_dim=1,
                                          activation='sigmoid',
                                          name='dense_layer'), 
                    name="timesteps_layer"))
    self._model.compile(optimizer=Adam(lr=lr),
                        loss='binary_crossentropy', 
                        metrics=['accuracy'])

The above code works, but consider the following:

I suggest the following solution.

MUCH BETTER SOLUTION (EMBEDDINGS LOADED EXTERNALLY + MASKING):

    self._model = Sequential(name='core_sequential')
    self._model.add(Masking(mask_value=0., input_shape=(None, 200)))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_1'))
    self._model.add(Bidirectional(LSTM(output_dim=300,
                                       return_sequences=distributed,
                                       activation="tanh",
                                       name="lstm_layer"),name='birnn_layer'))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_2'))
    self._model.add(TimeDistributed(Dense(output_dim=1,
                                          activation='sigmoid',
                                          name='dense_layer'), 
                    name="timesteps_layer"))
    self._model.compile(optimizer=Adam(lr=lr),
                        loss='binary_crossentropy', 
                        metrics=['accuracy'])

Feel free to comment and criticize, you are more than welcome!

Upvotes: 1

Views: 1199

Answers (1)

kiddo
kiddo

Reputation: 135

SOLUTION (EMBEDDINGS LOADED EXTERNALLY):

    self._model = Sequential(name='core_sequential')
    self._model.add(InputLayer((None, 200)))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_1'))
    self._model.add(Bidirectional(LSTM(output_dim=300,
                                       return_sequences=distributed,
                                       activation="tanh",
                                       name="lstm_layer"),name='birnn_layer'))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_2'))
    self._model.add(TimeDistributed(Dense(output_dim=1,
                                          activation='sigmoid',
                                          name='dense_layer'), 
                    name="timesteps_layer"))
    self._model.compile(optimizer=Adam(lr=lr),
                        loss='binary_crossentropy', 
                        metrics=['accuracy'])

The above code works, but consider the following:

  • You will possibly need to fine-tune from scratch!
  • You have to be extra careful with the max_length of your sequences. Few outliers (huge sequences) might create a problem.

I suggest the following solution.

MUCH BETTER SOLUTION (EMBEDDINGS LOADED EXTERNALLY + MASKING):

    self._model = Sequential(name='core_sequential')
    self._model.add(Masking(mask_value=0., input_shape=(None, 200)))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_1'))
    self._model.add(Bidirectional(LSTM(output_dim=300,
                                       return_sequences=distributed,
                                       activation="tanh",
                                       name="lstm_layer"),name='birnn_layer'))
    self._model.add(Dropout(dropout_rate,name='dropout_layer_2'))
    self._model.add(TimeDistributed(Dense(output_dim=1,
                                          activation='sigmoid',
                                          name='dense_layer'), 
                    name="timesteps_layer"))
    self._model.compile(optimizer=Adam(lr=lr),
                        loss='binary_crossentropy', 
                        metrics=['accuracy'])

Feel free to comment and criticize, you are more than welcome!

Upvotes: 0

Related Questions