Why is the most time spent in the fully connected layers despite its complexity is less than the conv-layers?

Question

When doing benchmarks of CNNs I found out that the most time is spent in the fully-connected layers. But when it comes to calculate the computational complexity I found out that:

O(conv) = N*(D * (W+P) * (H+P) *  h *w)/S
O(fully_connected) = D*W*H*N

Where

D = Dimensions Input 
W,w = Width Input, width Filter
H, h = Height Input, height Filter
S = Stride
P = Padding
N = number of outputs

For an example, I have a 1024x11x11 feature map input DxWxH, a 5x5 filter h,w without padding p, and with the Stride S of 1, and the number of outputs N shall be 512

This results in the following calculation for the convolution:

O(conv) = 512*(1024*11*11*5*5)/1 = 1 585 971 200

If the same input is used for a fully connected layer, and the desired output is still 512 then:

O(fully_connected) = 512*1024*11*11 = 63 438 848

Is this due to the more advanced methods for parallesing the convolutional layers on a GPU and the conv layer has more operations but less computation time cause of parallism issues? Or is my way of calculating the complexity of each layers simply wrong?

Martin Thoma · Accepted Answer

You can check if it is only the implementation by converting the fully-connected connections to equivalent convolutions. For every fully connected layer, there is an equivalent convolutional layer (see my question for details and examples).

You have c channels of size w × h (hence the shape c × w × h) followed by a fully-connected layer with n nodes.
Add a reshape layer after the channels to get (c ⋅ w ⋅ h) × 1 × 1.
Add a convolutional layer with n filters of size 1 × 1.

Now check the time. If it is faster than the fully connected layer, then it is simply due to a better implementation of convolution.

Why is the most time spent in the fully connected layers despite its complexity is less than the conv-layers?

Answers (1)

Related Questions