Why does the Gramian Matrix work for VGG16 but not for EfficientNet or MobileNet?

Question

A Neural Algorithm of Artistic Style uses the Gramian Matrix of the intermediate feature vectors of the VGG16 classification network trained on ImageNet. Back then, that was probably a good choice because VGG16 was one of the best-performing classification. Nowadays, there are much more efficient classification networks that surpass VGG in classification performance while requiring fewer parameters and FLOPS, for example EfficientNet and MobileNetv2.

But when I tried this out in practice, the Gramian Matrix for VGG16 features appears representative of the image style in that its L2 distance for stylistically similar images is smaller than the L2 distance to stylistically unrelated images. For the Gramian Matrix calculated from EfficientNet and MobileNetv2 features, that does not appear to be the case. The L2 distance between very similar images and between very dissimilar images only varies by about 5%.

From the network structure, VGG, EfficientNet, and MobileNet all have convolutions with batch normalization and ReLU in between, so the building blocks are the same. Then which design decision is unique to VGG so that its Gramian Matrix captures the style, while EfficientNet's and MobileNet's do not?

Why does the Gramian Matrix work for VGG16 but not for EfficientNet or MobileNet?

Answers (1)

Related Questions