EhsanYaghoubi
EhsanYaghoubi

Reputation: 145

Several FC layers in a row

I have a question about the role of the Fully Connected layer in the last layers of a CNN.

1- Is FC layer acts as a learner classifier?

2- Why we first use a linear activation function followed by a non-linear (e.g. softmax)?

3- What is the reason for adding several FC layers in a row on top of the network?

M_L = KL.Dense(512, activation='relu')(M_L)
M_L = KL.Dropout(DROPOUT_PROB)(M_L)
M_L = KL.Dense(256, activation='relu')(M_L)
M_L = KL.Dropout(DROPOUT_PROB)(M_L)
M_L = KL.Dense(128, activation='relu')(M_L)
M_L = KL.Dropout(DROPOUT_PROB)(M_L)
M_L = KL.Dense(64, activation='relu')(M_L)
M_L = KL.Dropout(DROPOUT_PROB)(M_L)
M_L = KL.Dense(1, activation='Sigmoid')(M_L)

4- What would be the difference if we only do like this:

M_L = KL.Dense(512, activation='relu')(M_L)
M_L = KL.Dropout(DROPOUT_PROB)(M_L)
M_L = KL.Dense(1, activation='Sigmoid')(M_L)

Or even:

M_L = KL.Dense(1, activation='Sigmoid')(M_L)

My intuition is that by adding more FC layers we have more trainable parameters. So, it will help to a multi-task network to have some specific parameters for a specific task. Am I right?

5- Do we have any other reason for adding several consecutive FC layers? is decreasing the features smoothly helps for training a classifier?

Upvotes: 0

Views: 308

Answers (1)

John Ladasky
John Ladasky

Reputation: 1064

The Universal Approximation Theorem states that a neural network needs only a single hidden layer with non-linear activation functions to model any function. That single layer might need to have an infinite number of filters to model the function perfectly -- but we can get an approximation of arbitrary accuracy by choosing a sufficiently large number of filters.

So you are right, in principle, your second architecture will be able to approximate your function. It won't do it as well as the first, but it will do something.

The third architecture is extremely weak, because you don't have a hidden layer. You only have a single filter with a sigmoid activation function. Presumably the function you want to model is constrained to the range 0 to 1. That's why there's a sigmoid output layer in all your architectures. Presumably, you have many inputs. All that will happen in your third architecture is that you will take a weighted, linear sum of your inputs, add one scalar weight (the bias), and then take the sigmoid of the result. That's not very expressive. You can't get arbitrarily close to an arbitrary function with this architecture.

Now, what's special about the first, "deep" architecture? The Universal Approximation Theorem says we only need one hidden layer, and the second architecture has that. So we could just make that single hidden layer wider, right? Well, the Universal Approximation Theorem doesn't say that a single hidden layer is the BEST way to model a function. Frequently, we find that multiple layers with progressively smaller numbers of filters produce better results. To achieve comparable results with the second architecture as you do with the first, you might need 10,000 filters in your hidden layer.

Before the introduction of ReLU, deep architectures trained very slowly or got stuck. That's not much of an issue now.

Upvotes: 1

Related Questions