user3180
user3180

Reputation: 1487

Convolutional neural networks

Is this intuitive understanding of convolutional neural networks correct: 1. A convolution basically matches how similar a local part of an image is to a convolutional kernel/filter 2. The kernel/filter is like a feature detector. Importantly it is learned and automatically changed and optimized through SGD

Upvotes: 1

Views: 413

Answers (2)

T3am5hark
T3am5hark

Reputation: 866

The explanation you present is roughly correct, but can benefit from some elaboration.

One way to think of convolutional neural networks is as location-independent pattern recognizers stacked in a hierarchical fashion. The convolution operation effects location-independence by applying the kernel at every location in the input space.

Each convolutional layer will identify specific features (which are learned from training). The output can be thought of as a mapping of what features are present and in which location.

Stacking convolutional layers allows subsequent layers to identify more complex features (you can think of each convolutional layer in the architecture as identifying features that are themselves compositions of features learned in previous layers).

In order to efficiently train these networks, we typically want to "funnel" down the data dimension as we get closer to the output classifier (moving from left to right in the network). This is typically a matter of sacrificing some of the granularity of the spatial information by means of sub-sampling (either by pooling operations (typically max, sometimes average) or by convolutional striding, which involves evaluation of the convolutional output at a decimated subset of possible output locations).

In concert, convolutional and pooling operations learn a nonlinear projection of the input space. In mathematical terms, the "deep" convolutional part of the network has learned a nonlinear mapping from a very highly-dimensional input space (RGB pixels for instance in the case of images) to a lower-dimensional output that basically conveys the incidence and locations (typically at a small fraction of the original spatial or temporal resolution) of a set of learned features.

Once we have such a low-dimensional representation, this is typically flattened and fed into a "traditional" fully-connected network that is able to operate efficiently for classification or prediction tasks on the (comparatively) low-dimensional abstract feature set produced by the convolutional layers.

Historically, many classification approaches relied on complicated "feature engineering" to perform similar operations to what the convolutional layers are learning in order to tractably and effectively train classifiers (which could be neural nets, random forests, SVMs, or any number of other algorithms). The power of ConvNets is their ability to eliminate the need for feature engineering and to fully integrate the task of feature learning with the training of the classifier.

Upvotes: 0

lejlot
lejlot

Reputation: 66805

This is true with veeeeeeeeeery rough understanding of "how similar". If you consider computation of dot product as measuring similarity then the answer is yes. Why I, personally, have doubts? Because it heavily depends on the norm of the vector (or matrix). Lets consider image

1 1 1
2 2 2
1 1 1

and kernel

1 1 1
2 2 2
1 1 1

we convolve them and get

1 + 1 + 1 + 2*2 + 2*2 + 2*2 + 1 + 1 + 1 = 18

now lets take image

2 2 2
2 2 2
2 2 2

and we get

2 + 2 + 2 + 2*2 + 2*2 + 2*2 + 2 + 2 + 2 = 24

I would say first image was more similar to the kernel than the other one, yet convolution says something else. Thus this is not that simple, convolution is simply a basic, linear filtering of image, convolving the signal, applying dot product to subsamples, but calling it "a similarity search" is a bit too much. It is, however a feature detector, a very specific one.

The crucial thing about convolutions, which you are missing in your description is the shared nature of these detectors, the fact, that you learn a bunch of local image filters which are applied to every single "spot" of the image, thus achieving a kind of location invariance and considerable reduction in parametrization of your model.

Upvotes: 4

Related Questions