bt3780
bt3780

Reputation: 13

Can't train a keras model to approximate a simple function

I just got started with machine learning and I tried to write a simple program where the nn will learn the simple function y = f(x) = 2x.

Here's the code:

#x is a 1D array of 1 to 1000
x = np.arange(1,1000, 1)
y = x*2

xtrain = x[:750]
ytrain = y[:750]
xtest = x[750:]
ytest = y[750:]

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten, Conv2D

model = Sequential()

model.add(Dense(128, input_dim=1, activation='relu'))

model.add(Dense(1, activation='relu'))

model.compile(loss='mean_squared_error', 
          optimizer='sgd', 
          metrics=['accuracy'])

model.summary()

history = model.fit(xtrain, ytrain, 
                batch_size=100, 
                epochs=20, 
                verbose=1, 
                validation_split=0.2)

I get the following output, no matter how I change the architecture or the hyperparameters:

79999/79999 [==============================] - 1s 13us/step - loss: 8533120007.8465 - acc: 0.0000e+00 - val_loss: 32532613324.8000 - val_acc: 0.0000e+00

the accuracy is 0 all the time. what am I doing wrong?

Upvotes: 1

Views: 780

Answers (2)

nuric
nuric

Reputation: 11225

It's actually what you would expect if you blindly run and expect gradient descent methods to learn any function. The behaviour you observe stems from 2 reasons:

  1. The derivative that SGD uses to update weights actually depends on the input. Take a very simple case y = f(wx + b), the derivative of y with respect to w is f'(wx + b)*x using the chain rule. So when there is an update for an input that is extremely large / unnormalised it blows up. Now the update is basically w' = w - alpha*gradient, so the weight suddenly becomes very small, in fact negative.
  2. After a single gradient update the output became negative because the SGD just overshot. Since you again have relu in the final layer, that just outputs 0 and the training stalls because when the output is negative derivative of relu is 0.

You can reduce the datasize to np.arange(1, 10) and reduce the number of hidden neurons to say 12 (more neurons make the output even more negative after single update as all their weights become negative as well) and you will be able to train the network.

Upvotes: 1

Kris
Kris

Reputation: 528

I think it works check this out. I used randn instead of arange. Other things are pretty much the same.

x = np.random.randn(1000)
y = x*2

xtrain = x[0:750]
ytrain = y[0:750]


model = Sequential()

model.add(Dense(128, input_dim=1, activation='relu'))
model.add(Dense(1))
model.summary()
sgd = optimizers.SGD(lr=0.01, decay=1e-6)
model.compile(loss='mean_squared_error', 
          optimizer=sgd, 
          metrics=['mae'])

history = model.fit(xtrain, ytrain,
                batch_size=100, 
                epochs=20, 
                verbose=1, 
                validation_split=0.2)

If you want to use the earlier dataset(ie arange). Here is accompanying code for that.

x = np.arange(1,1000, 1)
y = x*2

xtrain = x[0:750]
ytrain = y[0:750]


model = Sequential()

model.add(Dense(128, input_dim=1, activation='relu'))
model.add(Dense(1))
model.summary()
sgd = optimizers.Adam(lr=0.0001)
model.compile(loss='mean_squared_error', 
          optimizer=sgd, 
          metrics=['mae'])

history = model.fit(xtrain, ytrain,
                batch_size=100, 
                epochs=200, 
                verbose=1, 
                validation_split=0.2)

Upvotes: 0

Related Questions