user785099
user785099

Reputation: 5563

Translational equivariance and its relationship with convolutonal layer and spatial pooling layer

In the context of convolutional neural network model, I once heard a statement that:

One desirable property of convolutions is that they are translationally equivariant; and the introduction of spatial pooling can corrupt the property of translationally equivalent.

What does this statement mean, and why?

Upvotes: 2

Views: 1374

Answers (2)

prosti
prosti

Reputation: 46449

A note on translation equivariant and invariant terms. These terms are different.

Equivariant translation means that a translation of input features results in an equivalent translation of outputs. This is desirable when we need to find the pattern rectangle.

Invariant translation means that a translation of input does not change the outputs at all.

Translation invariance is so important to achieve. This effectively means after learning a certain pattern in the lower-left corner of a picture our convnet can recognize the pattern anywhere (also in the upper right corner).

As we know just a densely connected network without convolution layers in-between cannot achieve translation invariance.

We need to introduce convolution layers to bring generalization power to the deep networks and learn representations with fewer training samples.

Upvotes: 0

Salvador Dali
Salvador Dali

Reputation: 222811

Most probably you heard it from Bengio's book. I will try to give you my explanation.


In a rough sense, two transformations are equivariant if f(g(x)) = g(f(x)). In your case of convolutions and translations means that if you convolve(translate(x)) it would be the same as if you translate(convolve(x)). This is desired because if your convolution will find an eye of a cat in an image, it will find that eye if you will shift the image.

You can see this by yourself (I use 1d conv only because it is easy to calculate stuff). Lets convolve v = [4, 1, 3, 2, 3, 2, 9, 1] with k = [5, 1, 2]. The result will be [27, 12, 23, 17, 35, 21]

Now let's shift our v by appending it with something v' = [8] + v. Convolving with k you will get [46, 27, 12, 23, 17, 35, 21]. As you the result is just a previous result prepended with some new stuff.


Now the part about spatial pooling. Let's do a max-pooling of size 3 on the first result and on the second one. In the first case you will get [27, 35], in the second [46, 35, 21]. As you see 27 somehow disappeared (result was corrupted). It will be more corrupted if you will take an average pooling.

P.S. max/min pooling is the most translationally invariant of all poolings (if you can say so, if you compare the number of non-corrupt elements).

Upvotes: 4

Related Questions