Reputation: 145
My question is about forward and backward propagation for deep neural networks when the number of hidden units is greater than 1.
I know what I have to do if I have a single hidden layer. In case of a single hidden layer, if my input data X_train
has n
samples, with d
number of features (i.e. X_train
is a (n, d)
dimensional matrix, y_train
is a (n,1)
dimensional vector) and if I have h1
number of hidden units in my first hidden layer, then I use Z_h1 = (X_train * w_h1) + b_h1
(where w_h1
is a weight matrix with random number entries which has the shape (d, h1)
and b_h1
is a bias unit with shape (h1,1)
. I use sigmoid activation A_h1 = sigmoid(Z_h1)
and find that both A_h1
and Z_h1
have shapes (n, h1)
. If I have t
number of output units, then I use a weight matrix w_out
with dimensions (h1, t)
and b_out
with shape (t,1)
to get the output Z_out = (A_h1 * w_h1) + b_h1
. From here I can get A_out = sigmoid(Z_out)
which has shape (n, t)
. If I have a 2nd hidden layer (with h2 number of units) after the 1st hidden layer and before the output layer, then what steps must I add to the forward propagation and which steps should I modify?
I also have idea about how to tackle backpropagation in case of single hidden layer neural networks. For the single hidden layer example in the previous paragraph, I know that in the first backpropagation step (output layer -> hidden layer1)
, I should do Step1_BP1: Err_out = A_out - y_train_onehot
(here y_train_onehot
is the onehot representation of y_train
. Err_out
has shape (n,t)
. This is followed by Step2_BP1: delta_w_out = (A_h1)^T * Err_out
and delta_b_out = sum(Err_out)
. The symbol (.)^T
denotes the transpose of matrix. For the second backpropagation step (hidden layer1 -> input layer)
, we do the following Step1_BP2: sig_deriv_h1 = (A_h1) * (1-A_h1)
. Here sig_deriv_h1
has shape (n,h1)
. In the next step, I do Step2_BP2: Err_h1 = \Sum_i \Sum_j [ ( Err_out * w_out.T)_{i,j} * sig_deriv_h1__{i,j} )
]. Here, Err_h1
has shape (n,h1)
. In the final step, I do Step3_BP2: delta_w_h1 = (X_train)^T * Err_h1
and delta_b_h1 = sum(Err_h1)
. What backpropagation steps should I add if I have a 2nd hidden layer (h2 number of units) after the 1st hidden layer and before the output layer? Should I modify the backpropagation steps for the one hidden layer case that I have described here?
Upvotes: 4
Views: 3085
Reputation: 3043
● Let X be a matrix of samples with shape (n, d)
, where n
denotes number of samples, and d
denotes number of features.
● Let wh1 be the matrix of weights - of shape (d, h1)
, and
● Let bh1 be the bias vector of shape (1, h1)
.
You need the following steps for forward and backward propagations:
► FORWARD PROPAGATION:
⛶ Step 1:
Zh1 = [ X • wh1 ] + bh1
↓ ↓ ↓ ↓
(n,h1)
(n,d)
(d,h1)
(1,h1)
Here, the symbol • represents matrix multiplication, and the h1
denotes the number of hidden units in the first hidden layer.
⛶ Step 2:
Let Φ() be the activation function. We get.
ah1 = Φ (Zh1)
↓ ↓
(n,h1)
(n,h1)
⛶ Step 3:
Obtain new weights and biases:
● wh2 of shape (h1, h2)
, and
● bh2 of shape (1, h2)
.
⛶ Step 4:
Zh2 = [ ah1 • wh2 ] + bh2
↓ ↓ ↓ ↓
(n,h2)
(n,h1)
(h1,h2)
(1,h2)
Here, h2
is the number of hidden units in the second hidden layer.
⛶ Step 5:
ah2 = Φ (Zh2)
↓ ↓
(n,h2)
(n,h2)
⛶ Step 6:
Obtain new weights and biases:
● wout of shape (h2, t)
, and
● bout of shape (1, t)
.
Here, t
is the number of classes.
⛶ Step 7:
Zout = [ ah2 • wout ] + bout
↓ ↓ ↓ ↓
(n,t)
(n,h2)
(h2,t)
(1,t)
⛶ Step 8:
aout = Φ (Zout)
↓ ↓
(n,t)
(n,t)
► BACKWARD PROPAGATION:
⛶ Step 1:
Construct the one-hot encoded matrix of the unique output classes ( yone-hot ).
Errorout = aout - yone-hot
↓ ↓ ↓
(n,t)
(n,t)
(n,t)
⛶ Step 2:
Δwout = η ( ah2T • Errorout )
↓ ↓ ↓
(h2,t)
(h2,n)
(n,t)
Δbout = η [ ∑ i=1n (Errorout,i) ]
↓ ↓
(1,t)
(1,t)
Here η is the learning rate.
wout = wout - Δwout (weight update.)
bout = bout - Δbout (bias update.)
⛶ Step 3:
Error2 = [Errorout • woutT] ✴ Φ/ (ah2)
↓ ↓ ↓ ↓
(n,h2)
(n,t)
(t,h2)
(n,h2)
Here, the symbol ✴ denotes element wise matrix multiplication. The symbol Φ/ represents derivative of sigmoid function.
⛶ Step 4:
Δwh2 = η ( ah1T • Error2 )
↓ ↓ ↓
(h1,h2)
(h1,n)
(n,h2)
Δbh2 = η [ ∑ i=1n (Error2,i) ]
↓ ↓
(1,h2)
(1,h2)
wh2 = wh2 - Δwh2 (weight update.)
bh2 = bh2 - Δbh2 (bias update.)
⛶ Step 5:
Error3 = [Error2 • wh2T] ✴ Φ/ (ah1)
↓ ↓ ↓ ↓
(n,h1)
(n,h2)
(h2,h1)
(n,h1)
⛶ Step 6:
Δwh1 = η ( XT • Error3 )
↓ ↓ ↓
(d,h1)
(d,n)
(n,h1)
Δbh1 = η [ ∑ i=1n (Error3,i) ]
↓ ↓
(1,h1)
(1,h1)
wh1 = wh1 - Δwh1 (weight update.)
bh1 = bh1 - Δbh1 (bias update.)
Upvotes: 7
Reputation: 1841
For Forward Propagation, the dimension of the output from the first hidden layer must cope up with the dimensions of the second input layer.
As mentioned above, your input has dimension (n,d)
. The output from hidden layer1 will have a dimension of (n,h1)
. So the weights and bias for the second hidden layer must be (h1,h2)
and (h1,h2)
respectively.
So w_h2
will be of dimension (h1,h2)
and b_h2
will be (h1,h2)
.
The dimensions for the weights and bias for the output layer will be w_output
will be of dimension (h2,1)
and b_output
will be (h2,1)
.
The same you have to repeat in Backpropagation.
Upvotes: 1