Back-propagation and forward-propagation for 2 hidden layers in neural network

Question

My question is about forward and backward propagation for deep neural networks when the number of hidden units is greater than 1.

I know what I have to do if I have a single hidden layer. In case of a single hidden layer, if my input data X_train has n samples, with d number of features (i.e. X_train is a (n, d) dimensional matrix, y_train is a (n,1) dimensional vector) and if I have h1 number of hidden units in my first hidden layer, then I use Z_h1 = (X_train * w_h1) + b_h1 (where w_h1 is a weight matrix with random number entries which has the shape (d, h1) and b_h1 is a bias unit with shape (h1,1). I use sigmoid activation A_h1 = sigmoid(Z_h1) and find that both A_h1 and Z_h1 have shapes (n, h1). If I have t number of output units, then I use a weight matrix w_out with dimensions (h1, t) and b_out with shape (t,1) to get the output Z_out = (A_h1 * w_h1) + b_h1. From here I can get A_out = sigmoid(Z_out) which has shape (n, t). If I have a 2nd hidden layer (with h2 number of units) after the 1st hidden layer and before the output layer, then what steps must I add to the forward propagation and which steps should I modify?

I also have idea about how to tackle backpropagation in case of single hidden layer neural networks. For the single hidden layer example in the previous paragraph, I know that in the first backpropagation step (output layer -> hidden layer1), I should do Step1_BP1: Err_out = A_out - y_train_onehot (here y_train_onehot is the onehot representation of y_train. Err_out has shape (n,t). This is followed by Step2_BP1: delta_w_out = (A_h1)^T * Err_out and delta_b_out = sum(Err_out). The symbol (.)^T denotes the transpose of matrix. For the second backpropagation step (hidden layer1 -> input layer), we do the following Step1_BP2: sig_deriv_h1 = (A_h1) * (1-A_h1). Here sig_deriv_h1 has shape (n,h1). In the next step, I do Step2_BP2: Err_h1 = \Sum_i \Sum_j [ ( Err_out * w_out.T)_{i,j} * sig_deriv_h1__{i,j} )]. Here, Err_h1 has shape (n,h1). In the final step, I do Step3_BP2: delta_w_h1 = (X_train)^T * Err_h1 and delta_b_h1 = sum(Err_h1). What backpropagation steps should I add if I have a 2nd hidden layer (h2 number of units) after the 1st hidden layer and before the output layer? Should I modify the backpropagation steps for the one hidden layer case that I have described here?

Siddharth Satpathy · Accepted Answer

● Let X be a matrix of samples with shape (n, d), where n denotes number of samples, and d denotes number of features.

● Let w_h1 be the matrix of weights - of shape (d, h1) , and

● Let b_h1 be the bias vector of shape (1, h1).

You need the following steps for forward and backward propagations:

► FORWARD PROPAGATION:

⛶ Step 1:

Z_h1 = [ X • w_h1 ] + b_h1

↓ ↓ ↓ ↓

(n,h1) (n,d) (d,h1) (1,h1)

Here, the symbol • represents matrix multiplication, and the h1 denotes the number of hidden units in the first hidden layer.

⛶ Step 2:

Let Φ() be the activation function. We get.

a_h1 = Φ (Z_h1)

↓ ↓

(n,h1) (n,h1)

⛶ Step 3:

Obtain new weights and biases:

● w_h2 of shape (h1, h2), and

● b_h2 of shape (1, h2).

⛶ Step 4:

Z_h2 = [ a_h1 • w_h2 ] + b_h2

↓ ↓ ↓ ↓

(n,h2) (n,h1) (h1,h2) (1,h2)

Here, h2 is the number of hidden units in the second hidden layer.

⛶ Step 5:

a_h2 = Φ (Z_h2)

↓ ↓

(n,h2) (n,h2)

⛶ Step 6:

Obtain new weights and biases:

● w_out of shape (h2, t), and

● b_out of shape (1, t).

Here, t is the number of classes.

⛶ Step 7:

Z_out = [ a_h2 • w_out ] + b_out

↓ ↓ ↓ ↓

(n,t) (n,h2) (h2,t) (1,t)

⛶ Step 8:

a_out = Φ (Z_out)

↓ ↓

(n,t) (n,t)

► BACKWARD PROPAGATION:

⛶ Step 1:

Construct the one-hot encoded matrix of the unique output classes ( y_one-hot ).

Error_out = a_out - y_one-hot

↓ ↓ ↓

(n,t) (n,t) (n,t)

⛶ Step 2:

Δw_out = η ( a_h2^T • Error_out )

↓ ↓ ↓

(h2,t) (h2,n) (n,t)

Δb_out = η [ ∑ _i=1ⁿ (Error_out,i) ]

↓ ↓

(1,t) (1,t)

Here η is the learning rate.

w_out = w_out - Δw_out (weight update.)

b_out = b_out - Δb_out (bias update.)

⛶ Step 3:

Error₂ = [Error_out • w_out^T] ✴ Φ^/ (a_h2)

↓ ↓ ↓ ↓

(n,h2) (n,t) (t,h2) (n,h2)

Here, the symbol ✴ denotes element wise matrix multiplication. The symbol Φ^/ represents derivative of sigmoid function.

⛶ Step 4:

Δw_h2 = η ( a_h1^T • Error₂ )

↓ ↓ ↓

(h1,h2) (h1,n) (n,h2)

Δb_h2 = η [ ∑ _i=1ⁿ (Error_2,i) ]

↓ ↓

(1,h2) (1,h2)

w_h2 = w_h2 - Δw_h2 (weight update.)

b_h2 = b_h2 - Δb_h2 (bias update.)

⛶ Step 5:

Error₃ = [Error₂ • w_h2^T] ✴ Φ^/ (a_h1)

↓ ↓ ↓ ↓

(n,h1) (n,h2) (h2,h1) (n,h1)

⛶ Step 6:

Δw_h1 = η ( X^T • Error₃ )

↓ ↓ ↓

(d,h1) (d,n) (n,h1)

Δb_h1 = η [ ∑ _i=1ⁿ (Error_3,i) ]

↓ ↓

(1,h1) (1,h1)

w_h1 = w_h1 - Δw_h1 (weight update.)

b_h1 = b_h1 - Δb_h1 (bias update.)

Back-propagation and forward-propagation for 2 hidden layers in neural network

Answers (2)

Related Questions