Reputation: 59
I have read many papers and web articles that claim that depthwise-separable convolutions reduce the memory required by a deep learning model compared to standard convolution. However, I do not understand how this would be the case, since depthwise-separable convolution requires storing an extra intermediate-step matrix as well as the final output matrix.
Here are two scenarios:
Typical convolution: You have a 3x3 filter, which is applied to a 7x7 RGB input volume. This results in an output of size 5x5x1 which needs to be stored in GPU memory. Suppose activations are float32, this requires 100 bytes of memory
Depthwise-separable convolution: You have three 3x3x1 filters applied to a 7x7 RGB input volume. This results in three output volumes each of size 5x5x1. You then apply a 1x1 convolution to the concatenated 5x5x3 volume to get a final output volume of size 5x5x1. Hence, with float32 activations, this requires 300 bytes for the intermediate 5x5x3 volume, and 100 bytes for the final output. Hence a total of 400 bytes of memory
As additional evidence, when using an implementation U-Net in pytorch with typical nn.Conv2d convolutions, the model has 17.3M parameters and a forward/backward pass size of 320MB. If I replace all convolutions with depthwise-separable convolutions, the model has 2M parameters, and a forward/backward pass size of 500MB. So fewer parameters, but more memory required
I am sure I am going wrong somewhere, as every article states that depthwise-separable convolutions require less memory. Where am I going wrong with my logic?
Upvotes: 2
Views: 1037