Reputation: 1620
The loss function doesn’t approach 0. It doesn’t seem to converge, and consistently can’t predict Y. I've tried playing with the initializer, activation and layer sizes. Any insight here would be appreciated.
import tensorflow as tf
import keras
activation = 'relu'
initializer = 'he_uniform'
input_layer = tf.keras.layers.Input(shape=(1,),batch_size=1)
dense_layer = keras.layers.Dense(
32,
activation=activation,
kernel_initializer=initializer
)(input_layer)
dense_layer = keras.layers.Dense(
32,
activation=activation,
kernel_initializer=initializer
)(dense_layer)
predictions = keras.layers.Dense(1)(
dense_layer
)
model = keras.models.Model(inputs=input_layer, outputs=[predictions])
model.summary()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001)
x = tf.constant([[727.], [1424.], [379], [1777], [51.]])
y = tf.constant([[1.], [1.], [0.], [1.], [0.]])
for item in tf.data.Dataset.from_tensor_slices((x,y)).shuffle(5).repeat():
with tf.GradientTape() as tape:
x = item[0]
output = model(x)
loss = keras.losses.BinaryCrossentropy(
from_logits=True
)(item[1], output)
# loss = item[1] - output[0]
dy_dx = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(dy_dx, model.trainable_weights))
print("batch", item[0], "x", "output", output, "expected", item[1], "gradient", dy_dx[-1])
print("loss", loss)
Upvotes: 1
Views: 339
Reputation: 10475
Your input numbers are huge which leads to numerical issues, and you are not batching your inputs which leads to each batch producing very large gradients (again, due to the large input numbers) in possibly different directions. It works fine when I
.batch(5)
to the dataset definition (in fact, just replaced shuffle
because every batch contains the full dataset) to improve the gradient estimates,This should converge very quickly.
import tensorflow as tf
import keras
activation = 'relu'
initializer = 'he_uniform'
input_layer = tf.keras.layers.Input(shape=(1,))
dense_layer = keras.layers.Dense(
32,
activation=activation,
kernel_initializer=initializer
)(input_layer)
dense_layer = keras.layers.Dense(
32,
activation=activation,
kernel_initializer=initializer
)(dense_layer)
predictions = keras.layers.Dense(1)(
input_layer
)
model = keras.models.Model(inputs=input_layer, outputs=[predictions])
model.summary()
optimizer = tf.keras.optimizers.Adam(learning_rate=0.1)
x = tf.constant([[727.], [1424.], [379], [1777], [51.]]) / 1000.
y = tf.constant([[1.], [1.], [0.], [1.], [0.]])
for step, item in enumerate(tf.data.Dataset.from_tensor_slices((x,y)).batch(5).repeat()):
with tf.GradientTape() as tape:
x = item[0]
output = model(x)
loss = keras.losses.BinaryCrossentropy(
from_logits=True
)(item[1], output)
# loss = item[1] - output[0]
dy_dx = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(dy_dx, model.trainable_weights))
if not step % 100:
print("batch", item[0], "x", "output", tf.nn.sigmoid(output), "expected", item[1], "gradient", dy_dx[-1])
print("loss", loss)
And note: You using no activation function with a binary cross-entropy "from logits" is correct, so ignore people telling you otherwise.
Upvotes: 1
Reputation: 264
Your output layer - predictions
- is missing an activation. keras.layers.Dense
has a default value of None
for the activation
parameter. From your code it looks like you are doing binary classification, therefore your output layer should have a 'sigmoid'
activation.
On inference be sure to round the output of the model to 0 or 1 to get the predictions.
Upvotes: 0