user10853036
user10853036

Reputation: 145

Back-propagation and forward-propagation for 2 hidden layers in neural network

My question is about forward and backward propagation for deep neural networks when the number of hidden units is greater than 1.

I know what I have to do if I have a single hidden layer. In case of a single hidden layer, if my input data X_train has n samples, with d number of features (i.e. X_train is a (n, d) dimensional matrix, y_train is a (n,1) dimensional vector) and if I have h1 number of hidden units in my first hidden layer, then I use Z_h1 = (X_train * w_h1) + b_h1 (where w_h1 is a weight matrix with random number entries which has the shape (d, h1) and b_h1 is a bias unit with shape (h1,1). I use sigmoid activation A_h1 = sigmoid(Z_h1) and find that both A_h1 and Z_h1 have shapes (n, h1). If I have t number of output units, then I use a weight matrix w_out with dimensions (h1, t) and b_out with shape (t,1) to get the output Z_out = (A_h1 * w_h1) + b_h1. From here I can get A_out = sigmoid(Z_out) which has shape (n, t). If I have a 2nd hidden layer (with h2 number of units) after the 1st hidden layer and before the output layer, then what steps must I add to the forward propagation and which steps should I modify?

I also have idea about how to tackle backpropagation in case of single hidden layer neural networks. For the single hidden layer example in the previous paragraph, I know that in the first backpropagation step (output layer -> hidden layer1), I should do Step1_BP1: Err_out = A_out - y_train_onehot (here y_train_onehot is the onehot representation of y_train. Err_out has shape (n,t). This is followed by Step2_BP1: delta_w_out = (A_h1)^T * Err_out and delta_b_out = sum(Err_out). The symbol (.)^T denotes the transpose of matrix. For the second backpropagation step (hidden layer1 -> input layer), we do the following Step1_BP2: sig_deriv_h1 = (A_h1) * (1-A_h1). Here sig_deriv_h1 has shape (n,h1). In the next step, I do Step2_BP2: Err_h1 = \Sum_i \Sum_j [ ( Err_out * w_out.T)_{i,j} * sig_deriv_h1__{i,j} )]. Here, Err_h1 has shape (n,h1). In the final step, I do Step3_BP2: delta_w_h1 = (X_train)^T * Err_h1 and delta_b_h1 = sum(Err_h1). What backpropagation steps should I add if I have a 2nd hidden layer (h2 number of units) after the 1st hidden layer and before the output layer? Should I modify the backpropagation steps for the one hidden layer case that I have described here?

Upvotes: 4

Views: 3085

Answers (2)

Siddharth Satpathy
Siddharth Satpathy

Reputation: 3043

● Let X be a matrix of samples with shape (n, d), where n denotes number of samples, and d denotes number of features.

● Let wh1 be the matrix of weights - of shape (d, h1) , and

● Let bh1 be the bias vector of shape (1, h1).

You need the following steps for forward and backward propagations:

FORWARD PROPAGATION:

Step 1:

Zh1       =       [ X   •   wh1 ]     +     bh1

↓                       ↓         ↓                   ↓

(n,h1)     (n,d)   (d,h1)     (1,h1)

Here, the symbol • represents matrix multiplication, and the h1 denotes the number of hidden units in the first hidden layer.

Step 2:

Let Φ() be the activation function. We get.

ah1     =     Φ (Zh1)

  ↓                   ↓

(n,h1)       (n,h1)

Step 3:

Obtain new weights and biases:

wh2 of shape (h1, h2), and

bh2 of shape (1, h2).

Step 4:

Zh2       =       [ ah1   •   wh2 ]     +     bh2

↓                       ↓           ↓                   ↓

(n,h2)     (n,h1)   (h1,h2)     (1,h2)

Here, h2 is the number of hidden units in the second hidden layer.

Step 5:

ah2     =     Φ (Zh2)

  ↓                   ↓

(n,h2)       (n,h2)

Step 6:

Obtain new weights and biases:

wout of shape (h2, t), and

bout of shape (1, t).

Here, t is the number of classes.

Step 7:

Zout       =       [ ah2   •   wout ]     +     bout

↓                         ↓           ↓                   ↓

(n,t)       (n,h2)   (h2,t)     (1,t)

Step 8:

aout     =     Φ (Zout)

  ↓                   ↓

(n,t)       (n,t)

BACKWARD PROPAGATION:

Step 1:

Construct the one-hot encoded matrix of the unique output classes ( yone-hot ).

Errorout     =     aout   -   yone-hot

    ↓                     ↓               ↓

(n,t)           (n,t)       (n,t)

Step 2:

Δwout     =     η ( ah2T   •   Errorout )

    ↓                       ↓               ↓

(h2,t)         (h2,n)     (n,t)

Δbout     =     η [ ∑ i=1n  (Errorout,i) ]

    ↓                                 ↓

(1,t)                     (1,t)

Here η is the learning rate.

wout = wout - Δwout         (weight update.)

bout = bout - Δbout         (bias update.)

Step 3:

Error2     =     [Errorout   •   woutT]   ✴   Φ/ (ah2)

    ↓                     ↓                   ↓                   ↓

(n,h2)         (n,t)         (t,h2)       (n,h2)

Here, the symbol ✴ denotes element wise matrix multiplication. The symbol Φ/ represents derivative of sigmoid function.

Step 4:

Δwh2     =     η ( ah1T   •   Error2 )

    ↓                       ↓               ↓

(h1,h2)         (h1,n)     (n,h2)

Δbh2     =     η [ ∑ i=1n  (Error2,i) ]

    ↓                                 ↓

(1,h2)                     (1,h2)

wh2 = wh2 - Δwh2         (weight update.)

bh2 = bh2 - Δbh2         (bias update.)

Step 5:

Error3     =     [Error2   •   wh2T]   ✴   Φ/ (ah1)

    ↓                     ↓               ↓                   ↓

(n,h1)       (n,h2)     (h2,h1)       (n,h1)

Step 6:

Δwh1     =     η ( XT   •   Error3 )

    ↓                     ↓               ↓

(d,h1)         (d,n)     (n,h1)

Δbh1     =     η [ ∑ i=1n  (Error3,i) ]

    ↓                                 ↓

(1,h1)                     (1,h1)

wh1 = wh1 - Δwh1         (weight update.)

bh1 = bh1 - Δbh1         (bias update.)

Upvotes: 7

bumblebee
bumblebee

Reputation: 1841

For Forward Propagation, the dimension of the output from the first hidden layer must cope up with the dimensions of the second input layer.

As mentioned above, your input has dimension (n,d). The output from hidden layer1 will have a dimension of (n,h1). So the weights and bias for the second hidden layer must be (h1,h2) and (h1,h2) respectively.

So w_h2 will be of dimension (h1,h2) and b_h2 will be (h1,h2).

The dimensions for the weights and bias for the output layer will be w_output will be of dimension (h2,1) and b_output will be (h2,1).

The same you have to repeat in Backpropagation.

Upvotes: 1

Related Questions