Reputation: 97
I wish to extract feature vectors for areas of interest in object detection. I am using Faster RCNN & Inception-v2, essentially following this tutorial but I've added detection_features
as a key.
I was under the impression that the feature vector was the output of the CNN before it goes on to be classified. From looking at table 1 in the Inception-v2 paper, I expected this to be of size 1x1x1000. However, the size of output_dict['detection_features'][0]
in my code is 4x4x1024, which confuses me as it seems that is not a vector size that appears in any step of Inception-v2.
Any pointers as to why the sizes do not match would be greatly appreciated, I'm concerned I may have misunderstood something but I can't find much documentation on the feature vector in Tensorflow's object detection.
Many thanks
Upvotes: 0
Views: 247
Reputation: 1199
The specific number of units per layer isn't an architectural law; a network following the Inception V2 architecture is foremost a matter of the flow of information. Your situation looks fine. The creator of Keras once wrote that using units in multiples of 8 may provide a slight computational advantage, so your last layer units is perhaps slightly better than the paper's. As for the 4,4
bit, that can be a result of the input dimensions. This is why there is a minimum possible input size (otherwise some operations wouldn't have any pixels to work with). A larger input (image) following the same Inception V2 procedure will result in larger output dimensions. That's fine, it just means a straight flattening (between cnn and classifier) results in more units, or, alternatively, global pooling discards more information.
In summary: what you've done is perfectly fine. An architecture is a matter of graph operations, whereas specific implementations (dimensions within the network) are a matter of application.
Edit: a more thorough explanation
Convolutional layers are defined by their kernel shape and the number of units (the number of kernels). If an architecture uses a convolutional layer with a (3,3) matrix kernel, it will apply this regardless of the size of content provided to it (at least as large as the kernel). So if a network architecture like VGG (diagram) calls for a certain number of convolutional layers with (3,3) kernels followed by a (2,2) pooling layer, then it really doesn't matter if you make your network input shape (299,299,3) or (32,32,32). The same operations will be performed on the inputs, just a different number of times along the axes (including, of course, the last axis, which is the number of units, which is the number of unique kernels), resulting in a different output shape.
Upvotes: 1