Kami
Kami

Reputation: 23

Convergence issue with multi linear regression problems

I'm try to solve a multi-linear-regression problem with a very simple linear network. The network only consists of a single dense layer as its output layer and the activation function is set to linear. I synthesize the output data Y by multiplying the input data X by the system (weight) matrix A: Y=A.X . Both X and A contain random numbers with normal or uniform distributions (the problem happens regardless). In this case, the network reaches above 99% accuracy in only 7 Epochs over 1000 samples as one would expect.

Now, if I synthesize X from Y, which this time around has uniform random numbers, using A's inverse: X = inv(A).Y , and try to train the network, after two hundred Epochs, the accuracy only reaches 94%.

Why is there such a huge disparity between the two cases even-though the system matrix (weights) is exactly the same. The only difference is related to the random distribution of X and Y. If I'm forced to follow the second case, how can I improve the trainability of my network so that it can be trained in few epochs.

I have tried different optimizers, initializers and regularizations but they didn't help.

Here's the code for the version that doesn't converge so well. To get the first version I use gen1 in Dataset.from_generator(gen2, ...) instead of gen2.

import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import keras

N = 256
np.random.seed(0)
A = np.random.normal(0,.4,(N,N))
Ainv = np.linalg.inv(A)

import itertools

input_size = N
def gen1():
    for i in itertools.count(1):
        X = np.random.rand(N,1)-.5
        Y = np.dot(A,X)
        yield (X[:,0],Y[:,0])

def gen2():
    for i in itertools.count(1):
        Y = np.random.rand(N,1)-0.5
        X = np.dot(Ainv,Y)
        yield (X[:,0],Y[:,0])

dataset = tf.data.Dataset.from_generator(
     gen2,
     (tf.float64, tf.float64),
     (tf.TensorShape([N]), tf.TensorShape([N])))

train_ds = dataset.take(950)
valid_ds = dataset.skip(950).take(50)

#train_ds = train_ds.shuffle(2000, reshuffle_each_iteration = True)

train_ds = train_ds.batch(1)
valid_ds = valid_ds.batch(1)

from keras.layers import Input, Dense
from keras.models import Model
from keras import backend
 
def rabs(y_t, y_p):
    return backend.mean(backend.abs(y_p - y_t), axis=-1)/(tf.keras.backend.max(y_t) - tf.keras.backend.min(y_t))*100

inp = Input(shape=(input_size,))
out = Dense(N, activation='linear')(inp)

autoencoder = Model(inp, out)

#opt = tf.keras.optimizers.Adam(learning_rate=.0001)
opt = tf.keras.optimizers.SGD(learning_rate=.2, momentum=0.7)
autoencoder.compile(optimizer= opt,
              loss=tf.keras.losses.MeanSquaredError(),metrics= [rabs])

autoencoder.summary()

autoen_model = autoencoder.fit(train_ds, validation_data = valid_ds, epochs = 200)

plt.plot(autoen_model.history['rabs'])
plt.plot(autoen_model.history['val_rabs'])
plt.title('Model Accuracy')
plt.ylabel('Relative Absolute Mean Error %')
plt.xlabel('Epoch')
plt.legend(['Training set', 'Validation set'], loc='upper left')
plt.show()

Training graphs

Case 1: Y synthesized

Case 1: Y synthesized

Case 2: X synthesized

Case 2: X synthesized

Upvotes: 2

Views: 666

Answers (2)

tmilne
tmilne

Reputation: 26

Why I think this is happening

I'm going to ignore that you're doing stochastic gradient descent, and just imagine that you're working with the entire dataset for each step. In this case, your problem in both cases is to minimize ||Y-AX||^2 over A.

After doing some algebra, you can write this as a quadratic optimization problem of the form

\min_{z} z^T Q z + b^T z,

where z \in R^{256^2} represents the entries of the matrix A, Q is a symmetric matrix obtained only from X, and b is a vector obtained from X and Y. What you are asking Tensorflow to do is to solve this problem using gradient descent.

The convergence rate of gradient descent on this type of problem is governed by the condition number of Q, which is its largest eigenvalue divided by its smallest. A condition number that is much larger than one leads to slow gradient descent, as some variables update much faster than others. A condition number closer to one is best for obtaining fast convergence. In Guler's Foundations of Optimization (Section 14.2) you can read more about the effect of condition number on convergence of (a variant of) gradient descent, though there are probably better resources on this out there.

In your case, the eigenvalues of Q are just the eigenvalues of XX^T, which are the squared singular values of X. For the first dataset, X is uniformly distributed, and in the second X= A_0^{-1} Y, where Y is uniformly distributed.

The difference in convergence you are observing comes from the fact that multiplication by A_0^{-1} wildly increases the condition number of your matrix. In the following python code I did some random trials of this, and found that the condition number of the second matrix is way bigger. Thousands of times bigger.

import numpy as np

cond1 = []
cond2 = []


for i in range(10):
    A = np.random.normal(0,0.4,(256,256))
    Ainv = np.linalg.inv(A)

    X1 = np.random.rand(256,950)
    X1sv = np.linalg.svd(X1, compute_uv = False)

    Y = np.random.rand(256,950)
    X2 = np.dot(Ainv,Y)
    X2sv = np.linalg.svd(X2, compute_uv = False)

    cond1.append((X1sv.max()/X1sv.min())**2)
    cond2.append((X2sv.max()/X2sv.min())**2)
cond1 = np.array(cond1)
cond2 = np.array(cond2)

print('X1\'s condition number has mean {:.2f} and std {:.2f} '.format(cond1.mean(), cond1.std()))
print('X2\'s condition number has mean {:.2f} and std {:.2f} '.format(cond2.mean(), cond2.std()))
print('X2\'s mean condition number is {:.1f} times as big as X1\'s'.format(cond2.mean()/cond1.mean()))

So that's my guess as to why you're seeing worse convergence for the second case than the first. I could be wrong, but maybe this will point you in the right direction.

Suggested Solutions

There are a couple of solutions to this:

  1. Use a optimization algorithm like Adam or RMSprop which will make some efforts to improve the condition number of your matrix. You can learn more about those in Chapter 8 of https://www.deeplearningbook.org/.
  2. Do you need to have A be a Gaussian matrix? A matrix with eigenvalues closer to 1 would reduce this problem.
  3. There are optimization techniques (nothing to do with machine learning) that ameliorate the difficulties of a large condition number. You might look up preconditioned gradient descent for more information on this.

Upvotes: 1

Mr. For Example
Mr. For Example

Reputation: 4313

I don't think there anything wrong in the optimization process, I think the problem is your misleading metrics rabs(y_t, y_p)

For the outputs of rabs(y_t, y_p) is same after MAE divide by (backend.max(y_t) - backend.min(y_t)), the Y of gen1 and Y of gen2 need in the same probability distribution, which is not the case here, since in gen1 your Y = np.dot(Ainv,np.random.rand(N,1)) and in gen2 Y = np.random.rand(N,1)

Simple example here is to consider y_true_1 = (0.1, 0.2, 0.3), y_true_2 = (0.1, 0.2, 0.5) and y_predict_1 = (0.0, 0.1, 0.2), y_predict_2 = (0.0, 0.1, 0.4), then MAE_1 = MAE_2 = 0.1, but after MAE_1 divide by (max(y_true_1) - min(y_true_1 )) the RMAE_1 = 0.5 and MAE_2 divide by (max(y_true_2) - min(y_true_2 )) the RMAE_2 = 0.25, you can see now why if distribution of y_true_1 is different from distribution of y_true_2, then you cannot expect two outputs of rabs(y_t, y_p) will be the same

I change the rabs(y_t, y_p) to MAS:

def rabs(y_t, y_p):
    return backend.mean(backend.abs(y_p - y_t))

And optimizer to:

learning_rate_fn = tf.keras.optimizers.schedules.InverseTimeDecay(1.0, 950 * 100, 9)
opt = tf.keras.optimizers.Adam(learning_rate=learning_rate_fn)

And I run it many time with epochs = 100, the outputs for both gen1() and gen2() is around:

gen1:
Epoch 1/100
950/950 [==============================] - 1s 625us/step - loss: 1631.5898 - rabs: 31.9912 - val_loss: 1568.4200 - val_rabs: 31.6044
Epoch 100/100
950/950 [==============================] - 1s 541us/step - loss: 16.1436 - rabs: 3.1877 - val_loss: 19.1974 - val_rabs: 3.5311

gen2:
Epoch 1/100
950/950 [==============================] - 1s 614us/step - loss: 51.9863 - rabs: 5.7896 - val_loss: 20.9347 - val_rabs: 3.5948
Epoch 100/100
950/950 [==============================] - 1s 540us/step - loss: 0.7340 - rabs: 0.6716 - val_loss: 0.5478 - val_rabs: 0.5920

As you can see the optimizer basically does the same job, it reduce the loss(MSE) by 100 times and rabs(MAE) by 10 times

Upvotes: 0

Related Questions