scribbles
scribbles

Reputation: 4339

Keras autoencoder with pretrained embeddings returning incorrect number of dimensions

I have been attempting to replicate a sentence autoencoder loosely based off of an example from the Deep Learning with Keras book.

I recoded the example to use an embedding layer instead of the sentence generator and to use fit vs. fit_generator.

My code is as follows:

df_train_text = df['string']

max_length = 80
embedding_dim = 300
latent_dim = 512
batch_size = 64
num_epochs = 10

# prepare tokenizer
t = Tokenizer(filters='')
t.fit_on_texts(df_train_text)
word_index = t.word_index
vocab_size = len(t.word_index) + 1

# integer encode the documents
encoded_train_text = t.texts_to_matrix(df_train_text)

padded_train_text = pad_sequences(encoded_train_text, maxlen=max_length, padding='post')

padding_train_text = np.asarray(padded_train_text, dtype='int32')

embeddings_index = {}
f = open('/Users/embedding_file.txt')
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))
#Found 51328 word vectors.

embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector


embedding_layer = Embedding(vocab_size,
                            embedding_dim,
                            weights=[embedding_matrix],
                            input_length=max_length,
                            trainable=False)
inputs = Input(shape=(max_length,), name="input")
embedding_layer = embedding_layer(inputs)
encoder = Bidirectional(LSTM(latent_dim), name="encoder_lstm", merge_mode="sum")(embedding_layer)
decoder = RepeatVector(max_length)(encoder)
decoder = Bidirectional(LSTM(embedding_dim, name='decoder_lstm', return_sequences=True), merge_mode="sum")(decoder)
autoencoder = Model(inputs, decoder)
autoencoder.compile(optimizer="adam", loss="mse")


autoencoder.fit(padded_train_text, padded_train_text,
                epochs=num_epochs, 
                batch_size=batch_size,
                callbacks=[checkpoint])

I verified that my layer shapes are the same as those in the example, however when I try to fit my autoencoder, I get the following error:

ValueError: Error when checking target: expected bidirectional_1 to have 3 dimensions, but got array with shape (36320, 80)

A few other things I tried included switching texts_to_matrix to texts_to_sequence and wrapping/not wrapping my padded strings

I also came across this post which seems to indicate that I am going about this the wrong way. Is it possible to fit an autoencoder with the embedding layer as I have coded it? If not, can someone help explain the fundamental difference between what is going on with the provided example and my version?

EDIT: I removed the return_sequences=True argument in the last layer and got the following error: ValueError: Error when checking target: expected bidirectional_1 to have shape (300,) but got array with shape (80,)

After updating my layer shapes are:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input (InputLayer)           (None, 80)                0         
_________________________________________________________________
embedding_8 (Embedding)      (None, 80, 300)           2440200   
_________________________________________________________________
encoder_lstm (Bidirectional) (None, 512)               3330048   
_________________________________________________________________
repeat_vector_8 (RepeatVecto (None, 80, 512)           0         
_________________________________________________________________
bidirectional_8 (Bidirection (None, 300)               1951200   
=================================================================
Total params: 7,721,448
Trainable params: 5,281,248
Non-trainable params: 2,440,200
_________________________________________________________________

Am I missing a step between the RepeatVector layer and the last layer of the model so that I can return a shape of (None, 80, 300) rather than the (None, 300) shape it is currently generating?

Upvotes: 0

Views: 875

Answers (1)

today
today

Reputation: 33420

Embedding layer takes as input a sequence of integers (i.e. word indices) with a shape of (num_words,) and gives the corresponding embeddings as output with a shape of (num_words, embd_dim). So after fitting the Tokenizer instance on the given texts, you need to use its texts_to_sequences() method to transform each text to a sequence of integers:

encoded_train_text = t.texts_to_sequences(df_train_text)

Further, since after padding encoded_train_text it would have a shape of (num_samples, max_length), the output shape of the network must also have the same shape (i.e. since we are creating an autoencoder) and therefore you need to remove the return_sequences=True argument of last layer. Otherwise, it would give us a 3D tensor as output which does not make sense.

As a side note, the following line is redundant as padded_train_text is already a numpy array (and by the way you have not used padding_train_text at all):

padding_train_text = np.asarray(padded_train_text, dtype='int32')

Upvotes: 2

Related Questions