Reputation: 1617
"Obviously!", you might say... But there's one significant difference that I have trouble explaining by the difference in random initialization.
Take the two pre-trained basenets (before the average pooling layer) and feed them with the same image, you will notice that the output features don't follow the same distribution. Specifically, TensorFlow's backbone has more inhibited features by the ReLU compared to Pytorch's backbone. Additionally, as shows in the third figure, the dynamic range is different between the two frameworks.
Of course, this difference is absorbed by the dense layer addressing the classification task, but: Can that difference be explained by randomness in the training process? Or training time? Or is there something else that would explain the difference?
Code to reproduce:
import imageio
import numpy as np
image = imageio.imread("/tmp/image.png").astype(np.float32)/255
import tensorflow as tf
inputs = image[np.newaxis]
model = tf.keras.applications.ResNet50(include_top=False, input_shape=(None, None, 3))
output = model(inputs).numpy()
print(f"TensorFlow features range: [{np.min(output):.02f};{np.max(output):.02f}]")
import torchvision
import torch
model = torch.nn.Sequential(*list(torchvision.models.resnet50(pretrained=True).children())[0:8])
inputs = torch.tensor(image).permute(2,0,1).unsqueeze(0)
output = model(inputs).detach().permute(0,2,3,1).numpy()
print(f"Pytorch features range: [{np.min(output):.02f};{np.max(output):.02f}]")
Outputting
TensorFlow features range: [0.00;25.98]
Pytorch features range: [0.00;12.00]
Note: it's similar to any image.
Upvotes: 6
Views: 1977
Reputation: 2326
There are 2 things that differ in the implementations of ResNet50 in TensorFlow and PyTorch that I could notice and might explain your observation.
The batch normalization does not have the same momentum in both. It's 0.1 in PyTorch and 0.01 in TensorFlow (although it is reported as 0.99 I am writing it down in PyTorch's convention for comparison here). This might affect training and therefore the weights.
TensorFlow's implementation uses biases in convolutions while PyTorch's one doesn't (as can be seen in the conv3x3
and conv1x1
definitions). Because the batch normalization layers are affine, the biases are not needed, and are spurious. I think this is truly what explains the difference in your case since they can be compensated by the batch norm, and therefore be arbitrarily large, which would be why you observe a bigger range for TF.
Another way to see this is to compare the summaries as I did in this colab.
I currently have a PR that should fix the bias part (at least provide the possibility to train a resnet without conv bias in TF), and plan on submitting one for BN soon.
I have actually found out more differences, that I listed in a paper I recently wrote. You can check them in Table 3 of the F appendix.
I list here for completeness of the answer, those that might have an impact on the output features statistics:
Upvotes: 2
Reputation: 305
Keras and PyTorch differ significantly in terms of how standard deep learning models are defined, modified, trained, evaluated, and exported. For some parts it’s purely about different API conventions, while for others fundamental differences between levels of abstraction are involved.
Keras operates on a much higher level of abstraction. It is much more plug&play, and typically more succinct, but at the cost of flexibility. PyTorch provides more explicit and detailed code. In most cases, it means debuggable and flexible code, with only a small overhead. Yet, training is way more verbose in PyTorch. It hurts, but at times provides a lot of flexibility.
Apart from this, the way the same network is created in TensorFlow and PyTorch is different. In TensorFlow, a network predicts probabilities (has a built-in softmax function), and its built-in cost functions assume they work with probabilities. In PyTorch we have more freedom, but the preferred way is to return logits. This is done for numerical reasons, performing softmax then log-loss means doing unnecessary log(exp(x)) operations. So, instead of using softmax, we use LogSoftmax (and NLLLoss) or combine them into one nn.CrossEntropyLoss loss function.
Upvotes: -1