Yong Wang
Yong Wang

Reputation: 69

Why does the global average pooling work in ResNet?

Lately, I start a project about classification, using a very shallow ResNet. The model just has 10 conv. layer and then connects a Global avg pooling layer before softmax layer.

The performance is good as my expectation --- 93% (yeah, it is ok).

However, for some reasons, I need replace the Global avg pooling layer.

I have tried the following ways:

(Given the input shape of this layer [-1, 128, 1, 32], tensorflow form)

  1. Global max pooling layer. but got 85% ACC

  2. Exponential Moving Average. but got 12% (almost didn't work)

     split_list = tf.split(input, 128, axis=1)
     avg_pool = split_list[0]
     beta = 0.5
     for i in range(1, 128):
         avg_pool = beta*split_list[i] + (1-beta)*avg_pool
     avg_pool = tf.reshape(avg_pool, [-1,32])
    
  3. Split input into 4 parts, avg_pool each parts, finally concatenate them. but got 75%

     split_shape = [32,32,32,32]
     split_list = tf.split(input, 
                           split_shape, 
                           axis=1)
     for i in range(len(split_shape)):
         split_list[i] = tf.keras.layers.GlobalMaxPooling2D()(split_list[i])
     avg_pool = tf.concat(split_list, axis=1)
    
  4. Average the last channel. [-1, 128, 1, 32] --> [-1, 128], didn't work. ^

  5. Use a conv. layer with 1 kernel. In this way, the output shape is [-1, 128, 1, 1]. but didn't work, 25% or so.

I am pretty confused why global average pooling can work that well? And is there any other way to replace it?

Upvotes: 4

Views: 7258

Answers (1)

user8879803
user8879803

Reputation:

Global Average Pooling has the following advantages over the fully connected final layers paradigm:

  1. The removal of a large number of trainable parameters from the model. Fully connected or dense layers have lots of parameters. A 7 x 7 x 64 CNN output being flattened and fed into a 500 node dense layer yields 1.56 million weights which need to be trained. Removing these layers speeds up the training of your model.
  2. The elimination of all these trainable parameters also reduces the tendency of over-fitting, which needs to be managed in fully connected layers by the use of dropout.
  3. The authors argue in the original paper that removing the fully connected classification layers forces the feature maps to be more closely related to the classification categories – so that each feature map becomes a kind of “category confidence map”.
  4. Finally, the authors also argue that, due to the averaging operation over the feature maps, this makes the model more robust to spatial translations in the data. In other words, as long as the requisite feature is included / or activated in the feature map somewhere, it will still be “picked up” by the averaging operation.

Upvotes: 6

Related Questions