Temak
Temak

Reputation: 3009

Does Keras calculate gradients for frozen layers?

I use Keras with tensorflow backend.
Will Keras still calculate gradients for the layers which I set trainable = False?

I haven't observed a speedup for deep networks (like Resnet-50) when I fix a substantial part of the layers. It looks like the gradients are still being calculated for the fixed layers but their values are multiplied on 0. Can anyone tell me for sure is it true?

Here is an example of the small network, where I fix the first layer.

import numpy as np
import keras
import keras.applications.resnet50

x = keras.layers.Input(shape=(5,))
y = keras.layers.Dense(5)(x)

z = keras.layers.Dense(5)(y)
model = keras.models.Model(x, z)
for layer in model.layers[:2]:
    layer.trainable = False

model.compile(optimizer='rmsprop', loss='mse')
print model.summary()

X = np.random.rand(100, 5)

model.fit(X, X, epochs=100)

Upvotes: 4

Views: 922

Answers (1)

KT.
KT.

Reputation: 11440

If you look at the source code, you can see that the gradients are only computed with respect to _trainable_weights.

Note, though, that to compute any gradient you need to do a full forward pass over the network anyway. Уou then need backpropagate back all the way to the input of the first trainable layer as well. Consequently, the gains may indeed not be as large as you expect (it's not like if you set half of the weights to be non-trainable you would get a 2x speedup).

In your case having a non-trainable last weight would save you just one matrix multiplication out of four (2 forward, 2 backward). If I measure the runtime of your code with or without trainable first layer I see 1.4s vs 1.15s difference (Tensorflow CPU) or 13 vs 11s (Theano CPU pure-Python), which looks reasonable to me.

If you compare a longer network (say, add 10 layers in your example), the difference between having all layers trainable and only the last one becomes something like 10s vs 50s according to my measurements on (Theano pure-Python).

Note that you should normally never expect a performance gain of more than 50%, because you only save a part of the backward pass essentially. The heavy 5x win is most probably only possible due to Theano's optimization, which combines all non-trainable dense layers without activations into a single matrix multiplication. Indeed, on Tensorflow I only see a difference of 1.5s vs 2.0s here.

Upvotes: 7

Related Questions