Reputation: 4991
I am not able to understand the reason why the weights of following model are going smaller and smaller until NaN
during training.
The model is the following:
def initialize_embedding_matrix(embedding_matrix):
embedding_layer = Embedding(
input_dim=embedding_matrix.shape[0],
output_dim=embedding_matrix.shape[1],
weights=[embedding_matrix],
trainable=True)
return embedding_layer
def get_divisor(x):
return K.sqrt(K.sum(K.square(x), axis=-1))
def similarity(a, b):
numerator = K.sum(a * b, axis=-1)
denominator = get_divisor(a) * get_divisor(b)
denominator = K.maximum(denominator, K.epsilon())
return numerator / denominator
def max_margin_loss(positive, negative):
loss_matrix = K.maximum(0.0, 1.0 + negative - Reshape((1,))(positive))
loss = K.sum(loss_matrix, axis=-1, keepdims=True)
return loss
def warp_loss(X):
z, positive_entity, negatives_entities = X
positiveSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(1,), name="positive_sim")([z, positive_entity])
z_reshaped = Reshape((1, z.shape[1].value))(z)
negativeSim = Lambda(lambda x: similarity(x[0], x[1]), output_shape=(negatives_titles.shape[1].value, 1,), name="negative_sim")([z_reshaped, negatives_entities])
loss = Lambda(lambda x: max_margin_loss(x[0], x[1]), output_shape=(1,), name="max_margin")([positiveSim, negativeSim])
return loss
def mean_loss(y_true, y_pred):
return K.mean(y_pred - 0 * y_true)
def build_nn_model():
wl, tl = load_vector_lookups()
embedded_layer_1 = initialize_embedding_matrix(wl)
embedded_layer_2 = initialize_embedding_matrix(tl)
sequence_input_1 = Input(shape=(_NUMBER_OF_LENGTH,), dtype='int32',name="text")
sequence_input_positive = Input(shape=(1,), dtype='int32', name="positive")
sequence_input_negatives = Input(shape=(10,), dtype='int32', name="negatives")
embedded_sequences_1 = embedded_layer_1(sequence_input_1)
embedded_sequences_positive = Reshape((tl.shape[1],))(embedded_layer_2(sequence_input_positive))
embedded_sequences_negatives = embedded_layer_2(sequence_input_negatives)
conv_step1 = Convolution1D(
filters=1000,
kernel_size=5,
activation="tanh",
name="conv_layer_mp",
padding="valid")(embedded_sequences_1)
conv_step2 = GlobalMaxPooling1D(name="max_pool_mp")(conv_step1)
conv_step3 = Activation("tanh")(conv_step2)
conv_step4 = Dropout(0.2, name="dropout_mp")(conv_step3)
z = Dense(wl.shape[1], name="predicted_vec")(conv_step4) # activation="linear"
loss = warp_loss([z, embedded_sequences_positive, embedded_sequences_negatives])
model = Model(
inputs=[sequence_input_1, sequence_input_positive, sequence_input_negatives],
outputs=[loss]
)
model.compile(loss=mean_loss, optimizer=Adam())
return model
model = build_nn_model()
x, y_real, y_fake = load_x_y()
X_train = {
'text': x_train,
'positive': y_real_train,
'negatives': y_fake_train
}
model.fit(x=X_train, y=np.ones(len(x_train)), batch_size=10, shuffle=True, validation_split=0.1, epochs=10)
To describe the model a bit:
wl
,tl
) and I initialize the Keras embeddings with these values.sequence_input_1
has integers as input (indexes of words. ex. [42, 32 .., 4]
). On them sequence.pad_sequences(X, maxlen=_NUMBER_OF_LENGTH)
is used to have fixed length. sequence_input_positive
which is an integer of the positive output and sequence_input_negatives
which are N random negative outputs (10 in the code above) for each example.cosinus_similarity(positive_example, sequence_input_1)
andcosinus_similarity(negative_example[i], sequence_input_1)
and the Adam optimizer is used to minimize loss.While training this model even with only 20 data points the weights in the Convolution1D
and Dense
goes to NaN. If I add more data points the embedding weights go to NaN too. I can observe that as the model runs the weights are going smaller and smaller until they go to NaN. Something noticable also is that the loss does not go to NaN. When weights reach NaN, the loss goes to zero.
I am unable to find what is going wrong.
This is what I tried until now:
SGD
optimizer didn't change something in the behavior here.nan
values.np.linalg.norm
float64
to float32
Do you see anything strange in the architecture of the model? If not: I am unable to find a way to debug the architecture in order to understand why weights are going smaller and smaller till reach NaN. Is there some steps people are using when they notice this kind of behaviour?
Edit:
By using trainable=False
in the Embeddings this behaviour of nan
weights is NOT observed, and the training seems to have smooth results. However I want the embeddings to be trainable. So why this behavior when the embeddings are trainable??
Edit2:
Using trainable=True
and by uniformly randomly initializing the weights embeddings_initializer='uniform'
the training is smooth. So the reason happening is my word embeddings. I have checked my pre-trained word embeddings and there are no NaN
values. I have also normalized them in case this was causing it but no lack. Cant think anything else why these specific weights are giving this behaviour.
Edit3:
It seems that what causing this was that a lot of rows from one of the Embeddings trained in gensim where all zeros. ex.
[0.2, 0.1, .. 0.3],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.0, 0.0, .. 0.0],
[0.2, 0.1, .. 0.1]
It was not so easy to find it as the dimension of the embeddings where really big.
I am leaving this question open in case someone comes up with something similar or wants to answer the question asked above: "Is there some steps people are using when they notice this kind of behaviour?"
Upvotes: 1
Views: 781
Reputation: 86600
By your edits, it got a little easier to find the problem.
Those zeros passed unchanged to the warp_loss
function.
The part that went through the convolution remained unchanged at first, because any filters multiplied by zero result in zero, and the default bias initializer is also 'zeros'
. The same idea applies to the dense (filters * 0 = 0 and bias initializer = 'zeros')
That reached this line: return numerator / denominator
and caused an error (division by zero)
It's a common practice I've seen in many codes to add K.epsilon()
to avoid this:
return numerator / (denominator + K.epsilon())
Upvotes: 1