user9191983
user9191983

Reputation: 595

Why not all the activation functions are identical?

This is what I picked up from somewhere on the Internet. This is a very simple GAN+CNN modeling code especially for a descrinimator model, written in keras python3.6. It works pretty fine but I've got something not clear.

def __init__(self):
    self.img_rows = 28
    self.img_cols = 28
    self.channels = 1

def build_discriminator(self):
    img_shape = (self.img_rows, self.img_cols, self.channels)

    model = Sequential()
    model.add(Conv2D(64,5,5, strides=(2,2)
    padding='same', input_shape=img_shape))
    model.add(LeakyReLU(0.2))
    model.add(Conv2D(128,5,5,strides=(2,2)))
    model.add(LeakyReLU(0.2))
    model.add(Flatten())
    model.add(Dense(256))
    model.add(LeakyReLU(0.2))
    model.add(Dropout(0.5))
    model.add(Dense(1))
    model.add(Activation('sigmoid'))
    return model

There are some activation functions appearing but why aren't they all identical? If the very last output is 'sigmoid' here, I think the rest also better be the same functions? Why are LeakyReLU is used in the middle??Thanks.

Upvotes: 1

Views: 87

Answers (2)

Dr. Snoopy
Dr. Snoopy

Reputation: 56397

The output and hidden layer activation functions do not have to be identical. Hidden layer activations are part of the mechanisms that learns features, so its important that they do not have vanishing gradient issues (like sigmoid has), while the output layer activation function is more related to the output task, for example, softmax activation for classification.

Upvotes: 0

jottbe
jottbe

Reputation: 4521

I guess they didn't use sigmoid for the rest of the layers, because with sigmoid you have a big problem of vanishing gradients in deep networks. The reason is, that the sigmoid function "flattens out" on both sides around zero giving the layers towards the output layer a tendency to produce very small gradients and thus small learning rates, because loosely speaking, the gradient of the deeper layers is kind of a product of the gradients of the lower layers as a result of the chain rule of derivation. So if you have just few sigmoid layers, you might have luck with them, but as soon as you chain several of them, they produce instability in the gradients.

Its too complex for me to explain it in an article here, but if you want to know it in more detail, you can read it in a chapter of a online book. Btw. this book is really great. It's worth reading more. Probably to understand the chapter, you have to read chapter 1 of the book first, if you don't know how back propagation works.

Upvotes: 2

Related Questions