4d Input Tensor vs 1d Input Tensor (aka vector) to a neural network

Question

Reading about machine learning, I keep seeing references to the "input vector" or "feature vector", a 1d tensor that holds the input to the neural network. So for example a 28x28 grayscale image would be a 784 dimensional vector.

Then I also keep seeing references to images being a 4 dimensional tensor with the dimensions being number in batch, color channel, height, and width. For example, this is how it's described in "Deep Learning with Python, by Francois Chollet".

I'm wondering, why is it described in the different ways? When would one be used vs the other?

Jatentaki · Accepted Answer

There are two main considerations.

First is due to batching. Since we usually want to perform each optimization step based on gradient calculation for a number of training examples (and not just one), it is helpful to run the calculations for all of them at once. Therefore standard approach in many libraries is that the first dimension is the batch dimension, and all operations are applied independently for each subtensor along first dimension. Therefore most tensors in the actual code are at least 2-dimensional: [batch, any_other_dimensions...]. However, from the perspective of the neural network, batching is an implementation detail, so it is often skipped for clarity. Your link talks about 784 dimensional vectors, which are in practice almost undoubtedly processed in batches, so example tensors with batch size of 16 would be of size [batch, features] = [16, 784]. Summing up, we have the first dimension explained as batch, and then there are the any_other_dimensions... which in the above example happens to be a single features dimension of size 784.

Then come the 4 dimensional tensors, which arise when using convolutional neural networks, instead of fully connected ones. A fully connected network uses full matrices, which means that every neuron of the previous layer contributes to every neuron of the following layer. Convolutional neural networks can be seen as using a specially structured sparse matrix, where each neuron of the previous layer influences only some neurons of the following layer, namely those within some fixed distance of its location. Therefore, convolutions impose a spatial structure, which needs to be reflected in the intermediate tensors. Instead of [batch, features], we therefore need [batch, x, y] to reflect the spatial structure of the data. Finally, convolutional neural networks, in everyday practice, have a bit of admixture of fully-connected ones: they have the notion of multiple "features" which are localized spatially - giving raise to the so-called "feature maps" and the tensor raises to 4d: [batch, feature, x, y]. Each value tensor_new[b, f, x, x] is calculated based on all previous values tensor_previous[b', f', x', x'], subject to the following constraints:

b = b': we do not mix the batch elements
x' is at most some distance away from x and similarly for y': we only use the values in the spatial neighborhood
All f's are used: this is the "fully connected" part.

Convolutional neural networks are better suited to visual tasks than fully connected ones, which become infeasible for large enough images (imagine storing a fully connected matrix of size (1024 * 1024) ^ 2 for a 1024 x 1024px image). 4d tensors in CNNs are specific to 2d vision, you can encounter 3d tensors in 1d signal processing (for example sound): [batch, feature, time], 5d in 3d volume processing [batch, feature, x, y, z] and entirely different layouts in other kinds of networks which are neither fully-connected nor convolutional.

Summing up: if somebody tells you they are using 1d vectors, that's a simplification: almost surely the use at least two, for batching. Then, in the context of 2d computer vision, convolutional networks are the standard and they come with 4d tensors. In other scenarios, you may see even different layouts and dimensionalities. Keywords to google for more reading: fully connected neural networks, convolutional neural networks, minibatching or stochastic gradient descend (these two are closely related).

4d Input Tensor vs 1d Input Tensor (aka vector) to a neural network

Answers (1)

Related Questions