Keras sequence to sequence model loss increases without bound

Question

I'm building a sequence to sequence model in Keras to correct simple spelling mistakes that might occur. I'm following mostly this tutorial.

I have a pretty involved piece of code to generate random misspellings in words, and then I send the outputs of that: ([misspelled sentence,offset_sentence],original_sentence) into the model. The model I built looks pretty much exactly the same as the one in the tutorial:

    print('Training tokenizer...')
    tokenizer = CharToken()
    tokenizer.train_on_corpus(brown)

    num_chars = len(tokenizer.char_dict)
    alpha = 0.001

    encoder_inputs = Input(shape=(None,num_chars))
    encoder = LSTM(128,return_state=True)
    encoder_outputs,state_h,state_c = encoder(encoder_inputs)

    encoder_states = [state_h,state_c]

    decoder_inputs = Input(shape=(None,num_chars))
    decoder = LSTM(128,return_sequences=True,return_state=True)
    decoder_outputs,_,_ = 
    decoder(decoder_inputs,initial_state=encoder_states)
    decoder_dense = Dense(num_chars,activation='softmax')
    decoder_outputs = decoder_dense(decoder_outputs)

    model = keras.models.Model(inputs= 
    [encoder_inputs,decoder_inputs],outputs=decoder_outputs)
    optim = keras.optimizers.rmsprop(lr=alpha,decay=1e-6)
    model.compile(optimizer=optim,loss='categorical_crossentropy')
    model.summary()
    model.fit_generator(tokenizer.batch_generator(),
    steps_per_epoch=1000,epochs=1)

I'm sure that the problem is not in the tokenizer since I've gone through it and checked all of the outputs multiple times. The batch_generator method in that class outputs a tuple of one hot vectors representing ([misspelled sentence,offset_sentence],original_sentence). I've tried changing the hyperparmaters, including making the learning rate a miniscule 0.00001 but no matter what I do, the training loss always starts at around 11 and then just keep increasing...

Can anybody figure out what I did wrong?

EDIT: I did one more step of debugging where I removed the tokenizer from the equation and just tried to train the network on 3 random one-hot arrays. I reduced the complexity a lot by limiting them to have only 10 possible inputs/outputs (characters). The loss quickly rose to ~100 and stayed there. I expect for 10 possible outcomes that random guessing would get me a loss of around -ln(1/10)~2.3, certainly 100 is way too high. I also expect that even though I fed the network random arrays that it would eventually memorize those arrays and over-fit and the loss would decrease, but that's not the case. The loss stays around 100. I can't figure out what's going wrong...

EDIT 2: Some more debugging. I've beaten the model into a much simpler one by forcing the inputs and outputs to have the same length. This is not too bad for a spell corrector as long as the spelling mistakes aren't deleting or inserting too many characters (I pad the sequences anyways to make them the same length which allows me to be able to train on batches). However, the model still exhibits the same behavior. I've also tried running the model on random numbers and get the same 100ish loss:

    num_chars=10
    alpha = 0.001

    X = np.random.randint(10,size=(10000,30,10))
    Y = np.random.randint(10,size=(10000,30,10))

    inputs = Input(shape=(None,num_chars))
    x = LSTM(128,return_sequences=True)(inputs)
    x = LSTM(128,return_sequences=True)(x)
    output = Dense(num_chars,activation='softmax')(x)

    model = keras.models.Model(inputs=inputs,outputs=output)
    optim = keras.optimizers.rmsprop(lr=alpha,decay=1e-6)
    model.compile(optimizer=optim,loss='categorical_crossentropy')
    model.summary()
    model.fit(X,Y)

I'm beginning to wonder if there's something wrong with my installation of keras or something like that. I've run sequence models like this before many times on my other machine, and I've never observed this strange behavior.

EDIT 3: I just realized my debugging was flawed. I didn't turn the random Y array into a one-hot vector. When I do change it to a one-hot vector, the loss is as expected, about -ln(1/num_chars). This means that the problem is probably in my tokenizer generator. But I can't figure out what the problem is since I've printed out the output and saw that they were indeed one-hot vectors.

Keras sequence to sequence model loss increases without bound

Answers (1)

Related Questions