Reputation: 51
I'm trying to implement this approach for object detection and tracking. And I can't wrap my mind around the details. I tried to look for reviews and explaination for this article. What I don't undersand is this:
For temporal information, we take all the 3D points from the past 5 timestamps. Thus our input is a 4 dimensional tensor consisting of time, height, X and Y. For both our early-fusion and late-fusion models, we train from scratch using Adam optimizer with a learning rate of 1e-4. The model is trained on a 4 Titan XP GPU server with batch size of 12
I know that a CNN input is the following
[batch_size, channels, X, Y]
but here they are considering
[time, channels, X, Y]
and then they mention the batch size is 12! What i dont understand is where are they considering the batch_size and what does it represent for the 5 timestamps.
I hope someone can provide insights.
Since their dataset is not open source, I'm working on KITTI tracking benchmark.
Upvotes: 0
Views: 527
Reputation: 13651
If you consider tf.nn.conv3d
, the Input shape is:
Shape [batch, in_depth, in_height, in_width, in_channels]
You can see where the batch dimension goes and you can treat in_depth
as you wish. For temporal tasks, you can say that this represents some time steps.
Okay, specifically in their case. They have a point cloud. Each point (or voxel) is in an (X, Y)
position. This data point also has height
. They are very specific on saying:
"[...] and treat the height dimension as the channel dimension"
So, if we use channels-last notation (as the default TensorFlow docs), we have [X, Y, height]
(i.e., 3D points). Then, they say:
"[...] For temporal information, we take all the 3D points from the past 5 timestamps"
That means we need a temporal dimension, e.g., [time, X, Y, height]
, which is exactly what they said (except they used channels-first notation). With this 4D tensor, we can use 3D convolutions. However, we usually need them to operate on batches of samples rather than single samples. Hence the batch dimension: [batch, time, X, Y, height]
. In their case, specifically, they train with [12, 5, X, Y, height]
, where batch=12
and time=5
.
Upvotes: 1