Reputation: 191
I was trying to understand how Keras custom layers work. I am trying to create a multiplication layer that takes a scalar input and multiples it with the multiplicand. I generate some random data and want to learn the multiplicand. When I try with 10 numbers, it works fine. However, when I try with 20 numbers, the loss just explodes.
from keras import backend as K
from keras.engine.topology import Layer
from keras import initializers
class MultiplicationLayer(Layer):
def __init__(self, **kwargs):
super(MultiplicationLayer, self).__init__(**kwargs)
def build(self, input_shape):
# Create a trainable weight variable for this layer.
self.kernel = self.add_weight(name='multiplicand',
shape=(1,),
initializer='glorot_uniform',
trainable=True)
self.built = True
def call(self, x):
return self.kernel*x
def compute_output_shape(self, input_shape):
return input_shape
Using TensorFlow backend.
test the model 1 with 10 numbers
from keras.layers import Input
from keras.models import Model
# input is a single scalar
input = Input(shape=(1,))
multiply = MultiplicationLayer()(input)
model = Model(input, multiply)
model.compile(optimizer='sgd', loss='mse')
import numpy as np
input_data = np.arange(10)
output_data = 2 * input_data
model.fit(input_data, output_data, epochs=10)
#print(model.layers[1].multiplicand.get_value())
print(model.layers[1].get_weights())
Epoch 1/10 10/10 [==============================] - 7s - loss: 257.6145 Epoch 2/10 10/10 [==============================] - 0s - loss: 47.6329 Epoch 3/10 10/10 [==============================] - 0s - loss: 8.8073 Epoch 4/10 10/10 [==============================] - 0s - loss: 1.6285 Epoch 5/10 10/10 [==============================] - 0s - loss: 0.3011 Epoch 6/10 10/10 [==============================] - 0s - loss: 0.0557 Epoch 7/10 10/10 [==============================] - 0s - loss: 0.0103 Epoch 8/10 10/10 [==============================] - 0s - loss: 0.0019 Epoch 9/10 10/10 [==============================] - 0s - loss: 3.5193e-04 Epoch 10/10 10/10 [==============================] - 0s - loss: 6.5076e-05
[array([ 1.99935019], dtype=float32)]
test the model 2 with 20 numbers
from keras.layers import Input
from keras.models import Model
# input is a single scalar
input = Input(shape=(1,))
multiply = MultiplicationLayer()(input)
model = Model(input, multiply)
model.compile(optimizer='sgd', loss='mse')
import numpy as np
input_data = np.arange(20)
output_data = 2 * input_data
model.fit(input_data, output_data, epochs=10)
#print(model.layers[1].multiplicand.get_value())
print(model.layers[1].get_weights())
Epoch 1/10 20/20 [==============================] - 0s - loss: 278.2014 Epoch 2/10 20/20 [==============================] - 0s - loss: 601.1653 Epoch 3/10 20/20 [==============================] - 0s - loss: 1299.0583 Epoch 4/10 20/20 [==============================] - 0s - loss: 2807.1353 Epoch 5/10 20/20 [==============================] - 0s - loss: 6065.9375 Epoch 6/10 20/20 [==============================] - 0s - loss: 13107.8828 Epoch 7/10 20/20 [==============================] - 0s - loss: 28324.8320 Epoch 8/10 20/20 [==============================] - 0s - loss: 61207.1250 Epoch 9/10 20/20 [==============================] - 0s - loss: 132262.4375 Epoch 10/10 20/20 [==============================] - 0s - loss: 285805.9688
[array([-68.71629333], dtype=float32)]
Any insights why this might happen?
Upvotes: 2
Views: 111
Reputation: 86600
You can solve this by using another optimizer, such as Adam(lr=0.1)
. This unfortunately requires 100 epochs...., or by using a smaller learning rate in the SGD, such as SGD(lr = 0.001)
.
from keras.optimizers import *
# input is a single scalar
inp = Input(shape=(1,))
multiply = MultiplicationLayer()(inp)
model = Model(inp, multiply)
model.compile(optimizer=Adam(lr=0.1), loss='mse')
import numpy as np
input_data = np.arange(20)
output_data = 2 * input_data
model.fit(input_data, output_data, epochs=100)
#print(model.layers[1].multiplicand.get_value())
print(model.layers[1].get_weights())
Testing further, I noticed that SGD(lr = 0.001)
also works, while SGD(lr = 0.01)
blows out.
What I suppose is:
If your learning rate is enough to make your update go past the point by a distance greater than it was before, the next step will get an even greater gradient, making you go past the point again by an even greater distance.
Example with only one number:
inputNumber = 20
x = currentMultiplicand = 1
targetValue = 40
lr = 0.01
#first step (x=1):
mse = (40-20x)² = 400
gradient = -2*(40-20x)*20 = -800
update = - lr * gradient = 8
new x = 9
#second step (x=9):
mse = (40-20x)² = 19600 #(!!!!!)
gradient = -2*(40-20x)*20 = 5600
update = - lr * gradient = -56
new x = -47
#you can see from here that this is not going to be contained anymore...
The same example, with a lower learning rate:
inputNumber = 20
x = currentMultiplicand = 1
targetValue = 40
lr = 0.001
#first step (x=1):
mse = (40-20x)² = 400
gradient = -2*(40-20x)*20 = -800
update = - lr * gradient = 0.8
new x = 1.8
#second step (x=1.8):
mse = (40-20x)² = 16 #(now this is better)
gradient = -2*(40-20x)*20 = -160
update = - lr * gradient = 0.16 #(decreasing update sizes....)
new x = 1.96
#you can see from here that this converging...
Upvotes: 2