Fei
Fei

Reputation: 829

Tensorflow: What does tf.nn.separable_conv2d do?

I'm not quite sure what tf.nn.separable_conv2d does exactly. It seems to be that the pointwise_filter is the scaling factor for different features when generating one pixel of the next layer. But I'm not sure whether my interpretation is correct. Is there any reference for this method and what's the benefit?

tf.nn.separable_conv2d generates the same shape as tf.nn.conv2d. I would assume I can replace tf.nn.conv2d with tf.nn.separable_conv2d. But the result when using tf.nn.separable_conv2d seems to be very bad. The network stopped learning very early. For MNIST dataset, the accuracy is just random guess ~ 10%.

I thought when I set the pointwise_filter values to be all 1.0 and make it not trainable, I would get the same thing as the tf.nn.conv2d. But not really... still ~10% accuracy.

But when tf.nn.conv2d is used with the same hyper-parameters, the accuracy can be 99%. Why?

Also, it requires channel_multiplier * in_channels < out_channels. Why? What is the role of channel_multiplier here?

Thanks.

Edit:

I used channel_multiplier previously as 1.0. Maybe that is a bad choice. After I change it to 2.0, the accuracy becomes much better. But what is the role of channel_multiplier? Why 1.0 is not a good value?

Upvotes: 12

Views: 12160

Answers (3)

Rohola Zandie
Rohola Zandie

Reputation: 775

In the regular 2D convolution performed over multiple input channels, the filter is as deep as the input and lets us freely mix channels to generate each element in the output. Depthwise convolutions don't do that - each channel is kept separate - hence the name depthwise. Here's a diagram to help explain how that works[1]:

Seperable Conv2d

If you look at the official documentation you will find:

output[b, i, j, k] = sum_{di, dj, q, r} input[b, strides[1] * i + di, strides[2] * j + dj, q] * depthwise_filter[di, dj, q, r] * pointwise_filter[0, 0, q * channel_multiplier + r, k]

And a sample code in tensorflow to test:

import tensorflow as tf
import numpy as np

width = 8
height = 8
batch_size = 100
filter_height = 3
filter_width = 3
in_channels = 3
channel_multiplier = 1
out_channels = 3


input_tensor = tf.get_variable(shape=(batch_size, height, width, in_channels), name="input")
depthwise_filter = tf.get_variable(shape=(filter_height, filter_width, in_channels, channel_multiplier), name="deptwise_filter")
pointwise_filter = tf.get_variable(shape=[1, 1, channel_multiplier * in_channels, out_channels], name="pointwise_filter")

output = tf.nn.separable_conv2d(
    input_tensor,
    depthwise_filter,
    pointwise_filter,
    strides=[1,1,1,1],
    padding='SAME',
)

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    output_value = sess.run(output, feed_dict={input_tensor: np.random.rand(batch_size, width, height, in_channels),
                                               depthwise_filter: np.random.rand(filter_height, filter_width, in_channels, channel_multiplier),
                                               pointwise_filter: np.random.rand(1, 1, channel_multiplier * in_channels, out_channels)})
    print(np.shape(output_value))

credit:

[1] https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/

[2] https://www.tensorflow.org/api_docs/python/tf/nn/separable_conv2d

Upvotes: 4

Vincent Vanhoucke
Vincent Vanhoucke

Reputation: 666

tf.nn.separable_conv2d() implements the so-called 'separable convolution' described on slide 26 and onwards of this talk.

The idea is that instead of convolving jointly across all channels of an image, you run a separate 2D convolution on each channel with a depth of channel_multiplier. The in_channels * channel_multiplier intermediate channels get concatenated together, and mapped to out_channels using a 1x1 convolution.

It's often an effective way to reduce the parametric complexity of early convolutions in a convnet, and can materially speed up training. channel_multiplier controls that complexity, and would typically be 4 to 8 for a RGB input. For a grayscale input, using it makes little sense.

Upvotes: 29

iga
iga

Reputation: 3643

To answer the last part of the question:

Also, it requires channel_multiplier * in_channels < out_channels. Why?

I don't know why this constraint was put in originally, but it has been removed in the current master branch of TF and should make it to version 1.3. The thinking was probably something along the lines of "If you are reducing the reducing the number of channels in the pointwise step, you might have as well picked a smaller channel multiplier and saved on computation". I guess this reasoning is flawed because the pointwise step can combine values from different depthwise_filters or maybe because one might want to reduce the dimension a bit, not by a full factor.

Upvotes: 0

Related Questions