Weights of CNN model go to really small values and after NaN

Question

I am not able to understand the reason why the weights of following model are going smaller and smaller until NaN during training.

The model is the following:

def initialize_embedding_matrix(embedding_matrix):
    embedding_layer = Embedding(
        input_dim=embedding_matrix.shape[0],
        output_dim=embedding_matrix.shape[1],
        weights=[embedding_matrix],
        trainable=True)
    return embedding_layer

def get_divisor(x):
    return K.sqrt(K.sum(K.square(x), axis=-1))


def similarity(a, b):
    numerator = K.sum(a * b, axis=-1)
    denominator = get_divisor(a) * get_divisor(b)
    denominator = K.maximum(denominator, K.epsilon())
    return numerator / denominator


def max_margin_loss(positive, negative):
    loss_matrix = K.maximum(0.0, 1.0 + negative - Reshape((1,))(positive))
    loss = K.sum(loss_matrix, axis=-1, keepdims=True)
    return loss


def warp_loss(X):
    z, positive_entity, negatives_entities = X
    positiveSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(1,), name="positive_sim")([z, positive_entity])
    z_reshaped = Reshape((1, z.shape[1].value))(z)
    negativeSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(negatives_titles.shape[1].value, 1,), name="negative_sim")([z_reshaped, negatives_entities])
    loss = Lambda(lambda x: max_margin_loss(x[0], x[1]), output_shape=(1,), name="max_margin")([positiveSim, negativeSim])
    return loss

def mean_loss(y_true, y_pred):
    return K.mean(y_pred - 0 * y_true)

def build_nn_model():
    wl, tl = load_vector_lookups()
    embedded_layer_1 = initialize_embedding_matrix(wl)
    embedded_layer_2 = initialize_embedding_matrix(tl)

    sequence_input_1 = Input(shape=(_NUMBER_OF_LENGTH,), dtype='int32',name="text")
    sequence_input_positive = Input(shape=(1,), dtype='int32', name="positive")
    sequence_input_negatives = Input(shape=(10,), dtype='int32', name="negatives")

    embedded_sequences_1 = embedded_layer_1(sequence_input_1)
    embedded_sequences_positive = Reshape((tl.shape[1],))(embedded_layer_2(sequence_input_positive))
    embedded_sequences_negatives = embedded_layer_2(sequence_input_negatives)

    conv_step1 = Convolution1D(
        filters=1000,
        kernel_size=5,
        activation="tanh",
        name="conv_layer_mp",
        padding="valid")(embedded_sequences_1)

    conv_step2 = GlobalMaxPooling1D(name="max_pool_mp")(conv_step1)
    conv_step3 = Activation("tanh")(conv_step2)
    conv_step4 = Dropout(0.2, name="dropout_mp")(conv_step3)
    z = Dense(wl.shape[1], name="predicted_vec")(conv_step4) # activation="linear"

    loss = warp_loss([z, embedded_sequences_positive, embedded_sequences_negatives])
    model = Model(
        inputs=[sequence_input_1, sequence_input_positive, sequence_input_negatives],
        outputs=[loss]
        )
    model.compile(loss=mean_loss, optimizer=Adam())
    return model

model = build_nn_model()
x, y_real, y_fake = load_x_y()
    X_train = {
    'text': x_train,
    'positive': y_real_train,
    'negatives': y_fake_train
}

model.fit(x=X_train,  y=np.ones(len(x_train)), batch_size=10, shuffle=True, validation_split=0.1, epochs=10)

To describe the model a bit:

I have two pre-trained embeddings(wl,tl) and I initialize the Keras embeddings with these values.
There are 3 inputs. The sequence_input_1 has integers as input (indexes of words. ex. [42, 32 .., 4]). On them sequence.pad_sequences(X, maxlen=_NUMBER_OF_LENGTH) is used to have fixed length. sequence_input_positive which is an integer of the positive output and sequence_input_negatives which are N random negative outputs (10 in the code above) for each example.
max_margin_loss measures the difference between the cosinus_similarity(positive_example, sequence_input_1) andcosinus_similarity(negative_example[i], sequence_input_1) and the Adam optimizer is used to minimize loss.

While training this model even with only 20 data points the weights in the Convolution1D and Dense goes to NaN. If I add more data points the embedding weights go to NaN too. I can observe that as the model runs the weights are going smaller and smaller until they go to NaN. Something noticable also is that the loss does not go to NaN. When weights reach NaN, the loss goes to zero.

I am unable to find what is going wrong.

This is what I tried until now:

I have seen that people are using stochastic gradient descent when hinge loss is used. Using SGD optimizer didn't change something in the behavior here.
changed the number of batch size. No change in behavior.
checked input data not to have nan values.
normalized the input matrix (pre-trained data) for embedding with np.linalg.norm
transform pre-trained matrix from float64 to float32

Do you see anything strange in the architecture of the model? If not: I am unable to find a way to debug the architecture in order to understand why weights are going smaller and smaller till reach NaN. Is there some steps people are using when they notice this kind of behaviour?

Edit:

By using trainable=False in the Embeddings this behaviour of nan weights is NOT observed, and the training seems to have smooth results. However I want the embeddings to be trainable. So why this behavior when the embeddings are trainable??

Edit2:

Using trainable=True and by uniformly randomly initializing the weights embeddings_initializer='uniform' the training is smooth. So the reason happening is my word embeddings. I have checked my pre-trained word embeddings and there are no NaN values. I have also normalized them in case this was causing it but no lack. Cant think anything else why these specific weights are giving this behaviour.

Edit3:

It seems that what causing this was that a lot of rows from one of the Embeddings trained in gensim where all zeros. ex.

[0.2, 0.1, .. 0.3],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.2, 0.1, .. 0.1]

It was not so easy to find it as the dimension of the embeddings where really big.

I am leaving this question open in case someone comes up with something similar or wants to answer the question asked above: "Is there some steps people are using when they notice this kind of behaviour?"

Weights of CNN model go to really small values and after NaN

Answers (1)

Related Questions