Mpizos Dimitris
Mpizos Dimitris

Reputation: 4991

Weights of CNN model go to really small values and after NaN

I am not able to understand the reason why the weights of following model are going smaller and smaller until NaN during training.

The model is the following:

def initialize_embedding_matrix(embedding_matrix):
    embedding_layer = Embedding(
        input_dim=embedding_matrix.shape[0],
        output_dim=embedding_matrix.shape[1],
        weights=[embedding_matrix],
        trainable=True)
    return embedding_layer

def get_divisor(x):
    return K.sqrt(K.sum(K.square(x), axis=-1))


def similarity(a, b):
    numerator = K.sum(a * b, axis=-1)
    denominator = get_divisor(a) * get_divisor(b)
    denominator = K.maximum(denominator, K.epsilon())
    return numerator / denominator


def max_margin_loss(positive, negative):
    loss_matrix = K.maximum(0.0, 1.0 + negative - Reshape((1,))(positive))
    loss = K.sum(loss_matrix, axis=-1, keepdims=True)
    return loss


def warp_loss(X):
    z, positive_entity, negatives_entities = X
    positiveSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(1,), name="positive_sim")([z, positive_entity])
    z_reshaped = Reshape((1, z.shape[1].value))(z)
    negativeSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(negatives_titles.shape[1].value, 1,), name="negative_sim")([z_reshaped, negatives_entities])
    loss = Lambda(lambda x: max_margin_loss(x[0], x[1]), output_shape=(1,), name="max_margin")([positiveSim, negativeSim])
    return loss

def mean_loss(y_true, y_pred):
    return K.mean(y_pred - 0 * y_true)

def build_nn_model():
    wl, tl = load_vector_lookups()
    embedded_layer_1 = initialize_embedding_matrix(wl)
    embedded_layer_2 = initialize_embedding_matrix(tl)

    sequence_input_1 = Input(shape=(_NUMBER_OF_LENGTH,), dtype='int32',name="text")
    sequence_input_positive = Input(shape=(1,), dtype='int32', name="positive")
    sequence_input_negatives = Input(shape=(10,), dtype='int32', name="negatives")

    embedded_sequences_1 = embedded_layer_1(sequence_input_1)
    embedded_sequences_positive = Reshape((tl.shape[1],))(embedded_layer_2(sequence_input_positive))
    embedded_sequences_negatives = embedded_layer_2(sequence_input_negatives)

    conv_step1 = Convolution1D(
        filters=1000,
        kernel_size=5,
        activation="tanh",
        name="conv_layer_mp",
        padding="valid")(embedded_sequences_1)

    conv_step2 = GlobalMaxPooling1D(name="max_pool_mp")(conv_step1)
    conv_step3 = Activation("tanh")(conv_step2)
    conv_step4 = Dropout(0.2, name="dropout_mp")(conv_step3)
    z = Dense(wl.shape[1], name="predicted_vec")(conv_step4) # activation="linear"

    loss = warp_loss([z, embedded_sequences_positive, embedded_sequences_negatives])
    model = Model(
        inputs=[sequence_input_1, sequence_input_positive, sequence_input_negatives],
        outputs=[loss]
        )
    model.compile(loss=mean_loss, optimizer=Adam())
    return model

model = build_nn_model()
x, y_real, y_fake = load_x_y()
    X_train = {
    'text': x_train,
    'positive': y_real_train,
    'negatives': y_fake_train
}

model.fit(x=X_train,  y=np.ones(len(x_train)), batch_size=10, shuffle=True, validation_split=0.1, epochs=10)

To describe the model a bit:

While training this model even with only 20 data points the weights in the Convolution1D and Dense goes to NaN. If I add more data points the embedding weights go to NaN too. I can observe that as the model runs the weights are going smaller and smaller until they go to NaN. Something noticable also is that the loss does not go to NaN. When weights reach NaN, the loss goes to zero.

I am unable to find what is going wrong.

This is what I tried until now:

Do you see anything strange in the architecture of the model? If not: I am unable to find a way to debug the architecture in order to understand why weights are going smaller and smaller till reach NaN. Is there some steps people are using when they notice this kind of behaviour?

Edit:

By using trainable=False in the Embeddings this behaviour of nan weights is NOT observed, and the training seems to have smooth results. However I want the embeddings to be trainable. So why this behavior when the embeddings are trainable??

Edit2:

Using trainable=True and by uniformly randomly initializing the weights embeddings_initializer='uniform' the training is smooth. So the reason happening is my word embeddings. I have checked my pre-trained word embeddings and there are no NaN values. I have also normalized them in case this was causing it but no lack. Cant think anything else why these specific weights are giving this behaviour.

Edit3:

It seems that what causing this was that a lot of rows from one of the Embeddings trained in gensim where all zeros. ex.

[0.2, 0.1, .. 0.3],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.2, 0.1, .. 0.1]

It was not so easy to find it as the dimension of the embeddings where really big.

I am leaving this question open in case someone comes up with something similar or wants to answer the question asked above: "Is there some steps people are using when they notice this kind of behaviour?"

Upvotes: 1

Views: 781

Answers (1)

Daniel Möller
Daniel Möller

Reputation: 86600

By your edits, it got a little easier to find the problem.

Those zeros passed unchanged to the warp_loss function. The part that went through the convolution remained unchanged at first, because any filters multiplied by zero result in zero, and the default bias initializer is also 'zeros'. The same idea applies to the dense (filters * 0 = 0 and bias initializer = 'zeros')

That reached this line: return numerator / denominator and caused an error (division by zero)

It's a common practice I've seen in many codes to add K.epsilon() to avoid this:

return numerator / (denominator + K.epsilon())

Upvotes: 1

Related Questions