Keras: BiLSTM only works when return_sequences=True

Question

I've been trying to implement this BiLSTM in Keras: https://github.com/ffancellu/NegNN

Here is where I'm at, and it kind of works:

inputs_w = Input(shape=(sequence_length,), dtype='int32')
inputs_pos = Input(shape=(sequence_length,), dtype='int32')
inputs_cue = Input(shape=(sequence_length,), dtype='int32')

w_emb = Embedding(vocabulary_size+1, embedding_dim, input_length=sequence_length, trainable=False)(inputs_w)
p_emb = Embedding(tag_voc_size+1, embedding_dim, input_length=sequence_length, trainable=False)(inputs_pos)
c_emb = Embedding(2, embedding_dim, input_length=sequence_length, trainable=False)(inputs_cue)

summed = keras.layers.add([w_emb, p_emb, c_emb])

BiLSTM = Bidirectional(CuDNNLSTM(hidden_dims, return_sequences=True))(summed)

DPT = Dropout(0.2)(BiLSTM)

outputs = Dense(2, activation='softmax')(DPT)

checkpoint = ModelCheckpoint('bilstm_one_hot.hdf5', monitor='val_loss', verbose=1, save_best_only=True, mode='auto')
early = EarlyStopping(monitor='val_loss', min_delta=0.0001, patience=5, verbose=1, mode='auto')

model = Model(inputs=[inputs_w, inputs_pos, inputs_cue], outputs=outputs)

model.compile('adam', loss='categorical_crossentropy', metrics=['accuracy'])

model.summary()

model.fit([X_train, X_pos_train, X_cues_train], Y_train, batch_size=batch_size, epochs=num_epochs, verbose=1, validation_split=0.2, callbacks=[early, checkpoint])

In the original code, in Tensorflow, the author uses masking and softmax cross entropy with logits. I don't get how to implement this in Keras yet. If you have any advice don't hesitate.

My main issue here is with return_sequences=True. The author doesn't appear to be using it in his tensorflow implementation and when I turn it to False, I get this error:

ValueError: Error when checking target: expected dense_1 to have 2 dimensions, but got array with shape (820, 109, 2)

I also tried using:

outputs = TimeDistributed(Dense(2, activation='softmax'))(BiLSTM)

which returns and AssertionError without any information.

Any ideas ?

Thanks

sebrockm · Accepted Answer

the author uses masking and softmax cross entropy with logits. I don't get how to implement this in Keras yet.

Regarding softmax cross entropy with logits, you are doing it correctly. softmax_cross_entropy_with_logits as the loss function + no activation function on the last layer is the same as your approach with categorical_crossentropy as loss + softmax activation on the last layer. The only difference is that the latter one is numerically less stable. If this turns out to be an issue for you, you can (if your Keras backend is tensorflow) just pass tf.softmax_cross_entropy_with_logits as your loss. If you have another backend, you will have to look for an equivalent there.

Regarding masking, I'm not sure if I fully understand what the author is doing. However, in Keras the Embedding layer has a mask_zero parameter that you can set to True. In that case all timesteps that have a 0 will be ignored in all further calculations. In your source, it is not 0 that is being masked, though, so you would have to adjust the indices accordingly. If that doesn't work, there is the Masking layer in Keras that you can put before your recurrent layer, but I have little experience with that.

My main issue here is with return_sequences=True. The author doesn't appear to be using it

What makes you think that he doesn't use it? Just because that keyword does not appear in the code doesn't mean anything. But I'm also not sure. The code is pretty old and I don't find it in the docs anymore that could tell what the defaults are.

Anyway, if you want to use return_sequences=False (for whatever reason) be aware that this changes the output shape of the layer:

with return_sequences=True the output shape is (batch_size, timesteps, features)
with return_sequences=False the output shape is (batch_size, features)

The error you are getting is basically telling you that your network's output has one dimension less than the target y values you are feeding it. So, to me it looks like return_sequences=True is just what you need, but without further information it is hard to tell.

Then, regarding TimeDistributed. I'm not quite sure what you are trying to achieve with it, but quoting from the docs:

This wrapper applies a layer to every temporal slice of an input.

The input should be at least 3D, and the dimension of index one will be considered to be the temporal dimension.

(emphasis is mine)

I'm not sure from your question, in which scenario the empty assertion occurs.

If you have a recurrent layer with return_sequences=False before, you are again missing a dimension (I can't tell you why the assertion is empty, though).

If you have a recurrent layer with return_sequences=True before, it should work, but it would be completely useless, as Dense is applied in a time distributed way anyways. If I'm not mistaken, this behavior of the Dense layer was changed in some older Keras version (they should really update the example there and stop using Dense!). As the code you are referring to is quite old, it's well possible that TimeDistributed was needed back then, but is not needed anymore.

If your plan was to restore the missing dimension, TimeDistributed won't help you, but RepeatVector would. But, as already said, in that case better use return_sequences=True in the first place.

Keras: BiLSTM only works when return_sequences=True

Answers (2)

Related Questions