Limitation of Keras/Tensorflow for solving Linear Regression tasks

Question

I was trying to implement linear regression in Keras/TensorFlow and was very surprised how difficult it is. The standard examples work great on random data. However, if we change the input data a little bit, all examples stop work correctly.

I try to find coefficients for y = 0.5 * x1 + 0.5 * x2.

np.random.seed(1443)
n = 100000
x = np.zeros((n, 2))
y = np.zeros((n, 1))

x[:,0] =  sorted(preprocessing.scale( np.random.poisson(1000000, (n)) ))
x[:,1] =  sorted(preprocessing.scale( np.random.poisson(1000000, (n)) ) )
y = (x[:,0] + x[:,1]) /2

model = keras.Sequential()
model.add( keras.layers.Dense(1, input_shape =(2,), dtype="float32" ))
model.compile(loss='mean_squared_error', optimizer='sgd')

model.fit(x,y, epochs=1000, batch_size=64)
print(model.get_weights())

The results:

| epochs| batch_size |  bias      | x1         | x2
| ------+------------+------------+------------+-----------
| 1000  | 64         | -5.83E-05  | 0.90410435 | 0.09594361
| 1000  | 1024       | -5.71E-06  | 0.98739249 | 0.01258729
| 1000  | 10000      | -3.07E-07  | -0.2441376 | 1.2441349

My first thought was that it is a bug in Keras. So, I tried R/Tensorflow library:

floatType <- "float32"
p <- 2L
X <- tf$placeholder(floatType, shape = shape(NULL, p), name = "x-data")
Y <- tf$placeholder(floatType, name = "y-data")
W <- tf$Variable(tf$zeros(list(p, 1L), dtype=floatType))
b <- tf$Variable(tf$zeros(list(1L), dtype=floatType))
Y_hat <- tf$add(tf$matmul(X, W), b)
cost <- tf$reduce_mean(tf$square(Y_hat - Y))
generator <- tf$train$GradientDescentOptimizer(learning_rate=0.01)
optimizer <- generator$minimize(cost)

session <- tf$Session()
session$run(tf$global_variables_initializer())

set.seed(1443)
n <- 10^5
x <- matrix( replicate(p, sort(scale((rpois(n, 10^6))))) , nrow = n )
y <- matrix((x[,1]+x[,2])/2)

i <- 1
batch_size <- 10000
epoch_number  <- 1000
iterationNumber <- n*epoch_number / batch_size

while (iterationNumber > 0) {
  feed_dict <- dict(X = x[i:(i+batch_size-1),, drop = F], Y = y[i:(i+batch_size-1),, drop = F])
  session$run(optimizer, feed_dict = feed_dict)

  i <- i+batch_size
  if( i > n-batch_size)
    i <- i %% batch_size 

  iterationNumber <- iterationNumber - 1
}
r_model <- lm(y ~ x)
tf_coef <- c(session$run(b), session$run(W))
r_coef  <- r_model$coefficients
print(rbind(tf_coef, r_coef))

The results:

| epochs| batch_size |  bias      | x1         | x2
| ------+------------+------------+------------+-----------
|2000   | 64         | -1.33E-06  | 0.500307   | 0.4996932
|1000   | 1000       | 2.79E-08   | 0.5000809  | 0.499919
|1000   | 10000      | -4.33E-07  | 0.5004921  | 0.499507
|1000   | 100000     | 2.96E-18   | 0.5        | 0.5

Tensorflow finds the correct result only when batch size = samples number and the optimization algorithm is SGD. If optimization algorithm was "adam" or "adagrad", errors were much larger.

For obvious reasons, I cannot choose hyperparameter batch_size = n. Could you recommend any approaches to solve this problem with precision 1E-07 for Keras or TensorFlow?
Why TensorFlow finds better solutions than Keras?

Comment 1. Based on post "today" below: Train dataset shuffling will significantly improve the performance of TensorFlow version:

shuffledIndex<-sample(1:(nrow(x)))
x <- x[shuffledIndex,]
y <- y[shuffledIndex,,drop=FALSE]

For batch size = 2000:

|(Intercept)     |       x1  |        x2
|----------------+-----------+----------
|-1.130693e-09   | 0.5000004 | 0.4999989

today · Accepted Answer

The problem is that you are sorting the generated random numbers for each feature value. So they end up very close to each other:

>>> np.mean(np.abs(x[:,0]-x[:,1]))
0.004125721684553685

As a result we would have:

y = (x1 + x2) / 2
 ~= (x1 + x1) / 2
  = x1
  = 0.5 * x1 + 0.5 * x1
  = 0.3 * x1 + 0.7 * x1
  = -0.3 * x1 + 1.3 * x1
  = 10.1 * x1 - 9.1 * x1
  = thousands of other possible combinations

In this case the solution that Keras would converge to would really depend on the initial value of the weights and bias of Dense layer. With different initial values you would get different results (and possibly for some of them, it may not converge at all):

# set the initial weight of Dense layer
model.layers[0].set_weights([np.array([[0], [1]]), np.array([0])])

# fit the model ...

# the final weights
model.get_weights()

[array([[0.00203656],
        [0.9981099 ]], dtype=float32),
 array([4.5520876e-05], dtype=float32)]    # because: y = 0 * x1 + 1 * x1 = x1 ~= (x1 + x2) / 2

# again set the weights to something different
model.layers[0].set_weights([np.array([[0], [0]]), np.array([1])])

# fit the model...

# the final weights
model.get_weights()

[array([[0.49986306],
       [0.50013727]], dtype=float32),
 array([1.4176634e-08], dtype=float32)]    # the one you were looking for!

However, if you don't sort the features (i.e. just remove sorted) it is very likely that the converged weights to be very close to [0.5, 0.5].

Limitation of Keras/Tensorflow for solving Linear Regression tasks

Answers (1)

Related Questions