onemanarmy
onemanarmy

Reputation: 51

Implementation of a 3D convolutional neural network

I'm trying to implement this approach for object detection and tracking. And I can't wrap my mind around the details. I tried to look for reviews and explaination for this article. What I don't undersand is this:

For temporal information, we take all the 3D points from the past 5 timestamps. Thus our input is a 4 dimensional tensor consisting of time, height, X and Y. For both our early-fusion and late-fusion models, we train from scratch using Adam optimizer with a learning rate of 1e-4. The model is trained on a 4 Titan XP GPU server with batch size of 12

I know that a CNN input is the following

[batch_size, channels, X, Y]

but here they are considering

[time, channels, X, Y]

and then they mention the batch size is 12! What i dont understand is where are they considering the batch_size and what does it represent for the 5 timestamps.

I hope someone can provide insights.

Since their dataset is not open source, I'm working on KITTI tracking benchmark.

Upvotes: 0

Views: 527

Answers (1)

Berriel
Berriel

Reputation: 13651

If you consider tf.nn.conv3d, the Input shape is:

Shape [batch, in_depth, in_height, in_width, in_channels]

You can see where the batch dimension goes and you can treat in_depth as you wish. For temporal tasks, you can say that this represents some time steps.


Okay, specifically in their case. They have a point cloud. Each point (or voxel) is in an (X, Y) position. This data point also has height. They are very specific on saying:

"[...] and treat the height dimension as the channel dimension"

So, if we use channels-last notation (as the default TensorFlow docs), we have [X, Y, height] (i.e., 3D points). Then, they say:

"[...] For temporal information, we take all the 3D points from the past 5 timestamps"

That means we need a temporal dimension, e.g., [time, X, Y, height], which is exactly what they said (except they used channels-first notation). With this 4D tensor, we can use 3D convolutions. However, we usually need them to operate on batches of samples rather than single samples. Hence the batch dimension: [batch, time, X, Y, height]. In their case, specifically, they train with [12, 5, X, Y, height], where batch=12 and time=5.

Upvotes: 1

Related Questions