Spenhouet
Spenhouet

Reputation: 7159

How to get bias and neuron weights in optimizer?

In a TensorFlow optimizer (python) the method apply_dense does get called for the neuron weights (layer connections) and the bias weights but I would like to use both in this method.

def _apply_dense(self, grad, weight):
    ...

For example: A fully connected neural network with two hidden layer with two neurons and a bias for each.

Neural network example

If we take a look at layer 2 we get in apply_dense a call for the neuron weights:

neuron weights

and a call for the bias weights:

bias weights

But I would either need both matrix in one call of apply_dense or a weight matrix like this:

all weights from one layer

X_2X_4, B_1X_4, ... is just a notation for the weight of the connection between the two neurons. Therefore B_1X_4 ist only a placeholder for the weight between B_1 and X_4.

How to do this?

MWE

For an minimal working example here a stochastic gradient descent optimizer implementation with a momentum. For every layer the momentum of all incoming connections from other neurons is reduced to the mean (see ndims == 2). What i need instead is the mean of not only the momentum values from the incoming neuron connections but also from the incoming bias connections (as described above).

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
from tensorflow.python.training import optimizer


class SGDmomentum(optimizer.Optimizer):
    def __init__(self, learning_rate=0.001, mu=0.9, use_locking=False, name="SGDmomentum"):
        super(SGDmomentum, self).__init__(use_locking, name)
        self._lr = learning_rate
        self._mu = mu

        self._lr_t = None
        self._mu_t = None

    def _create_slots(self, var_list):
        for v in var_list:
            self._zeros_slot(v, "a", self._name)

    def _apply_dense(self, grad, weight):
        learning_rate_t = tf.cast(self._lr_t, weight.dtype.base_dtype)
        mu_t = tf.cast(self._mu_t, weight.dtype.base_dtype)
        momentum = self.get_slot(weight, "a")

        if momentum.get_shape().ndims == 2:  # neuron weights
            momentum_mean = tf.reduce_mean(momentum, axis=1, keep_dims=True)
        elif momentum.get_shape().ndims == 1:  # bias weights
            momentum_mean = momentum
        else:
            momentum_mean = momentum

        momentum_update = grad + (mu_t * momentum_mean)
        momentum_t = tf.assign(momentum, momentum_update, use_locking=self._use_locking)

        weight_update = learning_rate_t * momentum_t
        weight_t = tf.assign_sub(weight, weight_update, use_locking=self._use_locking)

        return tf.group(*[weight_t, momentum_t])

    def _prepare(self):
        self._lr_t = tf.convert_to_tensor(self._lr, name="learning_rate")
        self._mu_t = tf.convert_to_tensor(self._mu, name="momentum_term")

For a simple neural network: https://raw.githubusercontent.com/aymericdamien/TensorFlow-Examples/master/examples/3_NeuralNetworks/multilayer_perceptron.py (only change the optimizer to the custom SGDmomentum optimizer)

Upvotes: 3

Views: 1238

Answers (1)

javidcf
javidcf

Reputation: 59701

Update: I'll try to give a better answer (or at least some ideas) now that I have some understanding of your goal, but, as you suggest in the comments, there is probably not infallible way of doing this in TensorFlow.

Since TF is a general computation framework, there is no good way of determining what pairs of weights and biases are there in a model (or if it is a neural network at all). Here are some possible approaches to the problem that I can think of:

  • Annotating the tensors. This is probably not practical since you already said you have no control over the model, but an easy option would be to add extra attributes to the tensors to signify the weight/bias relationships. For example, you could do something like W.bias = B and B.weight = W, and then in _apply_dense check hasattr(weight, "bias") and hasattr(weight, "weight") (there may be some better designs in this sense).
  • You can look into some framework built on top of TensorFlow where you may have better information about the model structure. For example, Keras is a layer-based framework that implements its own optimizer classes (based on TensorFlow or Theano). I'm not too familiar with the code or its extensibility, but probably you have more tools there to use.
  • Detect the structure of the network yourself from the optimizer. This is quite complicated, but theoretically possible. from the loss tensor passed to the optimizer, it should be possible to "climb up" in the model graph to reach all of its nodes (taking the .op of the tensors and the .inputs of the ops). You could detect tensor multiplications and additions with variables and skip everything else (activations, loss computation, etc) to determine the structure of the network; if the model does not match your expectations (e.g. there are no multiplications or there is a multiplication without a later addition) you can raise an exception indicating that your optimizer cannot be used for that model.

Old answer, kept for the sake of keeping.

I'm not 100% clear on what you are trying to do, so I'm not sure if this really answers your question.

Let's say you have a dense layer transforming an input of size M to an output of size N. According to the convention you show, you'd have an N × M weights matrix W and a N-sized bias vector B. Then, an input vector X of size M (or a batch of inputs of size M × K) would be processed by the layer as W · X + B, and then applying the activation function (in the case of a batch, the addition would be a "broadcasted" operation). In TensorFlow:

X = ...  # Input batch of size M x K
W = ...  # Weights of size N x M
B = ...  # Biases of size N

Y = tf.matmul(W, X) + B[:, tf.newaxis]  # Output of size N x K
# Activation...

If you want, you can always put W and B together in a single extended weights matrix W*, basically adding B as a new row in W, so W* would be (N + 1) × M. Then you just need to add a new element to the input vector X containing a constant 1 (or a new row if it's a batch), so you would get X* with size N + 1 (or (N + 1) × K for a batch). The product W* · X* would then give you the same result as before. In TensorFlow:

X = ...  # Input batch of size M x K
W_star = ...  # Extended weights of size (N + 1) x M
# You can still have a "view" of the original W and B if you need it
W = W_star[:N]
B = W_star[-1]

X_star = tf.concat([X, tf.ones_like(X[:1])], axis=0)
Y = tf.matmul(W_star, X_star)  # Output of size N x K
# Activation...

Now you can compute gradients and updates for weights and biases together. A drawback of this approach is that if you want to apply regularization then you should be careful to apply it only on the weights part of the matrix, not on the biases.

Upvotes: 1

Related Questions